DIP-Python tutorials for image processing and machine learning(36-42)-Pandas

学习自 Youtube 博主 DigitalSreeni。

正文


36 - Introduction to Pandas - Data reading and handling

  • 绘制直方图
python
import pandas as pd
 
df = pd.read_csv('images/grains/grain_measurements.csv')
df['Area'].plot(kind='hist', title='Area', bins=50)
<AxesSubplot:title={'center':'Area'}, ylabel='Frequency'>
png
  • 创建 DataFrame, 修改 index 和 columns
python
data = [[10, 200, 60],
        [12, 155, 45],
        [9, 50, -45.],
        [16, 240, 90]]
 
df = pd.DataFrame(data, index=[1, 2, 3, 4], columns=['Area', 'Intensity', 'Orientation'])
df
AreaIntensityOrientation
11020060.0
21215545.0
3950-45.0
41624090.0

png
  • 查看各行缺失信息
python
import pandas as pd
 
df = pd.read_csv('data/manual_vs_auto.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  100 non-null    object 
 1   Image       100 non-null    object 
 2   Manual      94 non-null     float64
 3   Manual2     3 non-null      float64
 4   Auto_th_2   100 non-null    int64  
 5   Auto_th_3   100 non-null    int64  
 6   Auto_th_4   100 non-null    int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 5.6+ KB
  • 查看表格行列数
python
df.shape
(100, 7)
  • 查看整个表格
python
df
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
........................
95Set4Image96106.0NaN7511298
96Set4Image9780.0NaN668088
97Set4Image9892.0NaN739395
98Set4Image99116.0NaN10111593
99Set4Image10099.0NaN77106102

100 rows × 7 columns

  • 查看表格前 7 行
python
df.head(7)
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
5Set1Image6115.0NaN82124105
6Set1Image7102.0NaN6810393
  • 查看表格后 7 行
python
df.tail(7)
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
93Set4Image9481.0NaN659070
94Set4Image95NaNNaN10412288
95Set4Image96106.0NaN7511298
96Set4Image9780.0NaN668088
97Set4Image9892.0NaN739395
98Set4Image99116.0NaN10111593
99Set4Image10099.0NaN77106102
  • 将某一列改为 index
python
df1 = df.set_index('Image')
df1.head()
Unnamed: 0ManualManual2Auto_th_2Auto_th_3Auto_th_4
Image
Image1Set192.093.0708782
Image2Set187.083.0608583
Image3Set1104.098.0749994
Image4Set199.0NaN73101109
Image5Set189.0NaN599067
  • 查看 columns(列)名
python
df1.columns
Index(['Unnamed: 0', 'Manual', 'Manual2', 'Auto_th_2', 'Auto_th_3',
       'Auto_th_4'],
      dtype='object')
  • 去重
python
df['Unnamed: 0'].unique()
array(['Set1', 'Set2', 'Set3', 'Set4'], dtype=object)
  • 修改列名
python
df1 = df.rename(columns={'Unnamed: 0': 'Image_set'})
df1.columns
Index(['Image_set', 'Image', 'Manual', 'Manual2', 'Auto_th_2', 'Auto_th_3',
       'Auto_th_4'],
      dtype='object')
  • 显示数据类型
python
df.dtypes
Unnamed: 0     object
Image          object
Manual        float64
Manual2       float64
Auto_th_2       int64
Auto_th_3       int64
Auto_th_4       int64
dtype: object
  • 显示统计数据
python
df.describe()
ManualManual2Auto_th_2Auto_th_3Auto_th_4
count94.0000003.000000100.000000100.000000100.000000
mean100.02127791.33333376.37000097.58000093.210000
std11.2851407.63762611.97105512.32733714.128769
min80.00000083.00000055.00000071.00000063.000000
25%90.25000088.00000067.75000089.50000083.750000
50%101.00000093.00000074.50000098.50000093.000000
75%108.00000095.50000085.000000106.000000103.250000
max120.00000098.000000109.000000124.000000129.000000

37 - Introduction to Pandas - Data Manipulation

python
import pandas as pd
 
df = pd.read_csv('data/manual_vs_auto.csv')
df.head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
  • 删除某列
python
df1 = df.drop('Manual2', axis=1)
df1.head()
Unnamed: 0ImageManualAuto_th_2Auto_th_3Auto_th_4
0Set1Image192.0708782
1Set1Image287.0608583
2Set1Image3104.0749994
3Set1Image499.073101109
4Set1Image589.0599067
  • 删除多列
python
df2 = df.drop(['Manual2', 'Auto_th_2'], axis=1)
df2.head()
Unnamed: 0ImageManualAuto_th_3Auto_th_4
0Set1Image192.08782
1Set1Image287.08583
2Set1Image3104.09994
3Set1Image499.0101109
4Set1Image589.09067
  • 添加并填充一列
python
df['Date'] = '2019-06-24'
df.head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4Date
0Set1Image192.093.07087822019-06-24
1Set1Image287.083.06085832019-06-24
2Set1Image3104.098.07499942019-06-24
3Set1Image499.0NaN731011092019-06-24
4Set1Image589.0NaN5990672019-06-24
python
df.dtypes
Unnamed: 0     object
Image          object
Manual        float64
Manual2       float64
Auto_th_2       int64
Auto_th_3       int64
Auto_th_4       int64
Date           object
dtype: object
  • 将字符串转换成时间
python
df['Date'] = pd.to_datetime('2019-06-24')
df.head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4Date
0Set1Image192.093.07087822019-06-24
1Set1Image287.083.06085832019-06-24
2Set1Image3104.098.07499942019-06-24
3Set1Image499.0NaN731011092019-06-24
4Set1Image589.0NaN5990672019-06-24
python
df.dtypes
Unnamed: 0            object
Image                 object
Manual               float64
Manual2              float64
Auto_th_2              int64
Auto_th_3              int64
Auto_th_4              int64
Date          datetime64[ns]
dtype: object
  • 保存成 .csv 文件
python
df.to_csv('data/manual_vs_auto_updated.csv')
  • 删除某行
python
df1 = df.drop(df.index[1])
df1.head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4Date
0Set1Image192.093.07087822019-06-24
2Set1Image3104.098.07499942019-06-24
3Set1Image499.0NaN731011092019-06-24
4Set1Image589.0NaN5990672019-06-24
5Set1Image6115.0NaN821241052019-06-24
  • 删除前 10 行
python
df1 = df.iloc[10:,]
df1.head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4Date
10Set1Image1191.0NaN6187772019-06-24
11Set1Image12119.0NaN791051112019-06-24
12Set1Image13NaNNaN6590842019-06-24
13Set1Image14117.0NaN941151052019-06-24
14Set1Image1591.0NaN6699702019-06-24
  • 选取某些行
python
df1 = df[df['Unnamed: 0'] != 'Set1']
df1.head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4Date
25Set2Image26102.0NaN851031052019-06-24
26Set2Image2793.0NaN7684982019-06-24
27Set2Image2883.0NaN6271872019-06-24
28Set2Image29110.0NaN92117852019-06-24
29Set2Image3089.0NaN7096812019-06-24

38 - Introduction to Pandas - Data Sorting

  • 排序
python
import pandas as pd
 
df = pd.read_csv('data/manual_vs_auto.csv')
df2 = df.sort_values('Manual', ascending=True)  # ascending: 升序
  • 选取某行/列
python
df2[['Manual', 'Auto_th_2']]
ManualAuto_th_2
3480.058
9680.066
9381.065
6681.065
4482.067
.........
32NaN66
59NaN74
79NaN69
82NaN64
94NaN104

100 rows × 2 columns

python
df[20: 30]
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
20Set1Image2189.0NaN659486
21Set1Image2288.0NaN669683
22Set1Image23106.0NaN71112105
23Set1Image24107.0NaN9291111
24Set1Image25108.0NaN93113115
25Set2Image26102.0NaN85103105
26Set2Image2793.0NaN768498
27Set2Image2883.0NaN627187
28Set2Image29110.0NaN9211785
29Set2Image3089.0NaN709681
  • loc 方法是通过行、列的名称或者标签来寻找我们需要的值。

Pandas 读取某列、某行数据——loc、iloc 用法总结_子木同学的博客-CSDN 博客_pandas iloc

python
df.loc[20: 30, ['Manual', 'Auto_th_2']]
ManualAuto_th_2
2089.065
2188.066
22106.071
23107.092
24108.093
25102.085
2693.076
2783.062
28110.092
2989.070
30115.077
python
set2_df = df[df['Unnamed: 0'] == 'Set2']
set2_df.head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
25Set2Image26102.0NaN85103105
26Set2Image2793.0NaN768498
27Set2Image2883.0NaN627187
28Set2Image29110.0NaN9211785
29Set2Image3089.0NaN709681
  • 选取最大值
python
max(set2_df['Manual'])
120.0
  • 根据条件选取某些值
python
df['Manual'] > 100
0     False
1     False
2      True
3     False
4     False
      ...  
95     True
96    False
97    False
98     True
99    False
Name: Manual, Length: 100, dtype: bool
python
df[df['Manual'] > 100].head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
2Set1Image3104.098.0749994
5Set1Image6115.0NaN82124105
6Set1Image7102.0NaN6810393
7Set1Image8117.0NaN7712288
8Set1Image9104.0NaN8899112
  • 复合条件
python
df[(df['Manual'] > 100) & (df['Auto_th_2'] < 100)].head()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
2Set1Image3104.098.0749994
5Set1Image6115.0NaN82124105
6Set1Image7102.0NaN6810393
7Set1Image8117.0NaN7712288
8Set1Image9104.0NaN8899112
  • 遍历某些行/列
python
for index, row in df.iterrows():
    average_auto = (row['Auto_th_2'] + row['Auto_th_3'] + row['Auto_th_4']) / 3
    print(round(average_auto), row['Manual'])
80 92.0
76 87.0
89 104.0
94 99.0
72 89.0
104 115.0
88 102.0
96 117.0
100 104.0
87 103.0
75 91.0
98 119.0
80 nan
...

39 - Introduction to Pandas - Grouping Data

python
import pandas as pd
 
df = pd.read_csv('data/manual_vs_auto.csv')
df = df.rename(columns = {'Unnamed: 0': 'Image_set'})
df.head()
Image_setImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
python
df = df.drop('Manual2', axis=1)
df.head()
Image_setImageManualAuto_th_2Auto_th_3Auto_th_4
0Set1Image192.0708782
1Set1Image287.0608583
2Set1Image3104.0749994
3Set1Image499.073101109
4Set1Image589.0599067
  • 以 Image_set 为分组做统计
python
group_by_file = df.groupby(by=['Image_set'])
set_data_count = group_by_file.count()
set_data_avg = group_by_file.mean()
python
set_data_count
ImageManualAuto_th_2Auto_th_3Auto_th_4
Image_set
Set12524252525
Set22524252525
Set32524252525
Set42522252525
python
set_data_avg
ManualAuto_th_2Auto_th_3Auto_th_4
Image_set
Set1100.66666772.8498.0492.36
Set298.66666775.4098.0093.44
Set3100.00000078.4895.5294.40
Set4100.81818278.7698.7692.64
  • 统计关联性(手动与自动阈值)
python
df['Manual'].corr(df['Auto_th_2'])
0.7381233054217538

40 - Introduction to Pandas - Dealing with missing -null- data

python
import pandas as pd
 
df = pd.read_csv('data/manual_vs_auto.csv')
df.head(8)
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
5Set1Image6115.0NaN82124105
6Set1Image7102.0NaN6810393
7Set1Image8117.0NaN7712288
  • 查看缺失值
python
df.isnull()
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0FalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseTrueFalseFalseFalse
4FalseFalseFalseTrueFalseFalseFalse
........................
95FalseFalseFalseTrueFalseFalseFalse
96FalseFalseFalseTrueFalseFalseFalse
97FalseFalseFalseTrueFalseFalseFalse
98FalseFalseFalseTrueFalseFalseFalse
99FalseFalseFalseTrueFalseFalseFalse

100 rows × 7 columns

python
df.isnull().sum()
Unnamed: 0     0
Image          0
Manual         6
Manual2       97
Auto_th_2      0
Auto_th_3      0
Auto_th_4      0
dtype: int64
  • 删除缺失值
python
df = df.drop('Manual2', axis=1)
df2 = df.dropna()
df2.head(10)
Unnamed: 0ImageManualAuto_th_2Auto_th_3Auto_th_4
0Set1Image192.0708782
1Set1Image287.0608583
2Set1Image3104.0749994
3Set1Image499.073101109
4Set1Image589.0599067
5Set1Image6115.082124105
6Set1Image7102.06810393
7Set1Image8117.07712288
8Set1Image9104.08899112
9Set1Image10103.0699894
python
df = pd.read_csv('data/manual_vs_auto.csv')
df.describe()
ManualManual2Auto_th_2Auto_th_3Auto_th_4
count94.0000003.000000100.000000100.000000100.000000
mean100.02127791.33333376.37000097.58000093.210000
std11.2851407.63762611.97105512.32733714.128769
min80.00000083.00000055.00000071.00000063.000000
25%90.25000088.00000067.75000089.50000083.750000
50%101.00000093.00000074.50000098.50000093.000000
75%108.00000095.50000085.000000106.000000103.250000
max120.00000098.000000109.000000124.000000129.000000
  • 填充缺失值
python
df['Manual'].fillna(100, inplace=True)
df.head(10)
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
5Set1Image6115.0NaN82124105
6Set1Image7102.0NaN6810393
7Set1Image8117.0NaN7712288
8Set1Image9104.0NaN8899112
9Set1Image10103.0NaN699894
  • 使用平均值填充缺失值
python
import numpy as np
 
df = pd.read_csv('data/manual_vs_auto.csv')
df['Manual'] = df.apply(
    lambda row: (round((row['Auto_th_2'] + row['Auto_th_3'] + row['Auto_th_3']) / 3))  # 平均值
    if np.isnan(row['Manual'])  # 如果是缺失值的话
    else row['Manual'], axis=1)  # 填充在 Manual 列上
python
df.head(10)
Unnamed: 0ImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
5Set1Image6115.0NaN82124105
6Set1Image7102.0NaN6810393
7Set1Image8117.0NaN7712288
8Set1Image9104.0NaN8899112
9Set1Image10103.0NaN699894

41 - Introduction to Pandas - Plotting

python
import pandas as pd
 
df = pd.read_csv('data/manual_vs_auto.csv')
df = df.rename(columns={'Unnamed: 0': 'Image_set'})
df.head()
Image_setImageManualManual2Auto_th_2Auto_th_3Auto_th_4
0Set1Image192.093.0708782
1Set1Image287.083.0608583
2Set1Image3104.098.0749994
3Set1Image499.0NaN73101109
4Set1Image589.0NaN599067
  • 绘制折线图
python
df['Manual'].plot()
<AxesSubplot:>
png
  • 绘制直方图
python
# 类型 hist,分组 30,标题 Manual Count,图像大小 12 * 10
df['Manual'].plot(kind='hist', bins=30, title='Manual Count', figsize=(12, 10))
<AxesSubplot:title={'center':'Manual Count'}, ylabel='Frequency'>
png
python
df['Manual'].rolling(3).mean().plot()
<AxesSubplot:>
png
python
df['Manual'].describe()
count     94.000000
mean     100.021277
std       11.285140
min       80.000000
25%       90.250000
50%      101.000000
75%      108.000000
max      120.000000
Name: Manual, dtype: float64
python
df['Manual'].plot(kind='box', figsize=(8, 6))
<AxesSubplot:>
png
  • 散点图
python
df.plot(kind='scatter', x='Manual', y='Auto_th_2', title='Manual vs Auto 2')
<AxesSubplot:title={'center':'Manual vs Auto 2'}, xlabel='Manual', ylabel='Auto_th_2'>
png
python
def cell_count(x):
    if x <= 100.0:
        return 'low'
    else:
        return 'high'
python
df['cell_count_index'] = df['Manual'].apply(cell_count)
df.head()
Image_setImageManualManual2Auto_th_2Auto_th_3Auto_th_4cell_count_index
0Set1Image192.093.0708782low
1Set1Image287.083.0608583low
2Set1Image3104.098.0749994high
3Set1Image499.0NaN73101109low
4Set1Image589.0NaN599067low
python
df.to_csv('data/manual_vs_auto2.csv')
python
df.boxplot(column='Manual', by='cell_count_index')
<AxesSubplot:title={'center':'Manual'}, xlabel='cell_count_index'>
png

42 - Introduction to Seaborn Plotting in Python

python
import pandas as pd
 
df = pd.read_csv('data/manual_vs_auto.csv')
df['Manual'].fillna(100, inplace=True)
df = df.rename(columns={'Unnamed: 0': 'Image_Set'})
python
import seaborn as sns
 
sns.distplot(df['Manual'])
C:\Users\gzjzx\anaconda3\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)



<AxesSubplot:xlabel='Manual', ylabel='Density'>
png
  • sns.kdeplot()核密度估计图
python
sns.kdeplot(df['Manual'], shade=True)
sns.kdeplot(df['Auto_th_2'], shade=True)
sns.kdeplot(df['Auto_th_3'], shade=True)
sns.kdeplot(df['Auto_th_4'], shade=True)
<AxesSubplot:xlabel='Manual', ylabel='Density'>
png
  • sns.jointplot() 双变量关系图
python
sns.jointplot(x='Manual', y='Auto_th_2', data=df, kind='kde')
<seaborn.axisgrid.JointGrid at 0x212f9ad23d0>
png
  • sns.pairplot() 用来展示两两特征之间的关系
python
sns.pairplot(df, x_vars=['Auto_th_2', 'Auto_th_3', 'Auto_th_4'], y_vars='Manual', height=6)
<seaborn.axisgrid.PairGrid at 0x212f9bd0fd0>
png
  • sns.lmplot() 展示线性关系
python
sns.lmplot(x='Manual', y='Auto_th_2', data=df, order=1, hue='Image_Set')
<seaborn.axisgrid.FacetGrid at 0x212fa457f70>
png
python
from scipy import stats
 
slope, intercept, r_value, p_value, std_err = stats.linregress(df['Manual'], df['Auto_th_2'])
slope, intercept, r_value, p_value, std_err
(0.772483189743971,
 -0.8937686381919718,
 0.7058094587729904,
 2.396963973676236e-16,
 0.07831918096230937)
  • sns.swarmplot() 分簇散点图
python
df = pd.read_csv('data/manual_vs_auto2.csv')
df['Manual'].fillna(100, inplace=True)
df = df.rename(columns={'Unnamed: 0': 'Image_Set'})
 
sns.swarmplot(x='Image_Set', y='Manual', data=df, hue='cell_count_index', dodge=True)
<AxesSubplot:xlabel='Image_Set', ylabel='Manual'>
png
  • sns.heatmap() 热图
python
corr = df.loc[:,df.dtypes == 'int64'].corr() #Correlates all int64 columns
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap=sns.diverging_palette(220, 10, as_cmap=True))
<AxesSubplot:>
png