正文
3.1 pandas 基本介绍
如果把 numpy 比作列表,pandas 可以比作字典
1 2 import pandas as pdimport numpy as np
创建 pandas 序列
1 pd.Series([1 , 3 , 6 , np.nan, 44 , 1 ])
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
创建一个日期序列
1 2 dates = pd.date_range('20160101' , periods=6 ) dates
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06'],
dtype='datetime64[ns]', freq='D')
生成数据表格, pd.DataFrame(数据, 索引, 标题列表)
1 pd.DataFrame(np.random.randn(6 , 4 ), index=dates, columns=['a' , 'b' , 'c' , 'd' ])
a
b
c
d
2016-01-01
0.014513
0.490584
-1.985363
-1.734158
2016-01-02
1.694216
0.383375
-1.260541
-0.126581
2016-01-03
0.475547
1.050239
0.897093
1.155942
2016-01-04
0.126726
-1.169920
-1.876652
-1.245558
2016-01-05
0.606119
0.648469
-1.367697
-0.822617
2016-01-06
-0.615075
0.557680
-2.104794
0.114070
默认生成的数据表格
1 pd.DataFrame(np.arange(12 ).reshape((3 , 4 )))
0
1
2
3
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
用字典代替要输入的值
1 2 3 4 5 6 7 df = pd.DataFrame({'A' : 1 , 'B' : pd.Timestamp('20130102' ), 'C' : pd.Series(1 , index=list (range (4 )), dtype='float32' ), 'D' : np.array([3 ] * 4 , dtype='int32' ), 'E' : pd.Categorical(["test" , "train" , "test" , "train" ]), 'F' : 'foo' }) df
A
B
C
D
E
F
0
1
2013-01-02
1.0
3
test
foo
1
1
2013-01-02
1.0
3
train
foo
2
1
2013-01-02
1.0
3
test
foo
3
1
2013-01-02
1.0
3
train
foo
输出表格的类型
A int64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
输出表格的索引
Int64Index([0, 1, 2, 3], dtype='int64')
输出表格的行名
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
输出表格每行的数据
array([[1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
输出表格每行的数据特征
A
C
D
count
4.0
4.0
4.0
mean
1.0
1.0
3.0
std
0.0
0.0
0.0
min
1.0
1.0
3.0
25%
1.0
1.0
3.0
50%
1.0
1.0
3.0
75%
1.0
1.0
3.0
max
1.0
1.0
3.0
转置表格
0
1
2
3
A
1
1
1
1
B
2013-01-02 00:00:00
2013-01-02 00:00:00
2013-01-02 00:00:00
2013-01-02 00:00:00
C
1.0
1.0
1.0
1.0
D
3
3
3
3
E
test
train
test
train
F
foo
foo
foo
foo
排序表格
1 df.sort_index(axis=1 , ascending=False )
F
E
D
C
B
A
0
foo
test
3
1.0
2013-01-02
1
1
foo
train
3
1.0
2013-01-02
1
2
foo
test
3
1.0
2013-01-02
1
3
foo
train
3
1.0
2013-01-02
1
1 df.sort_index(axis=0 , ascending=False )
A
B
C
D
E
F
3
1
2013-01-02
1.0
3
train
foo
2
1
2013-01-02
1.0
3
test
foo
1
1
2013-01-02
1.0
3
train
foo
0
1
2013-01-02
1.0
3
test
foo
A
B
C
D
E
F
0
1
2013-01-02
1.0
3
test
foo
2
1
2013-01-02
1.0
3
test
foo
1
1
2013-01-02
1.0
3
train
foo
3
1
2013-01-02
1.0
3
train
foo
3.2 pandas 选择数据
1 2 3 4 5 6 import pandas as pdimport numpy as np dates = pd.date_range('20130101' , periods=6 ) df = pd.DataFrame(np.arange(24 ).reshape((6 , 4 )), index=dates, columns=['A' , 'B' , 'C' , 'D' ]) df
A
B
C
D
2013-01-01
0
1
2
3
2013-01-02
4
5
6
7
2013-01-03
8
9
10
11
2013-01-04
12
13
14
15
2013-01-05
16
17
18
19
2013-01-06
20
21
22
23
选择表格中 A 列的所有数据
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32
2013-01-01 0
2013-01-02 4
2013-01-03 8
2013-01-04 12
2013-01-05 16
2013-01-06 20
Freq: D, Name: A, dtype: int32
选择表格中某些行中所有的数据
A
B
C
D
2013-01-01
0
1
2
3
2013-01-02
4
5
6
7
2013-01-03
8
9
10
11
1 df['20130102' :'20130104' ]
A
B
C
D
2013-01-02
4
5
6
7
2013-01-03
8
9
10
11
2013-01-04
12
13
14
15
根据标签选择数据
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int32
A
B
2013-01-01
0
1
2013-01-02
4
5
2013-01-03
8
9
2013-01-04
12
13
2013-01-05
16
17
2013-01-06
20
21
1 df.loc['20130102' , ['A' , 'B' ]]
A 4
B 5
Name: 2013-01-02 00:00:00, dtype: int32
根据位置选择数据
A 12
B 13
C 14
D 15
Name: 2013-01-04 00:00:00, dtype: int32
13
B
C
2013-01-04
13
14
2013-01-05
17
18
B
C
2013-01-02
5
6
2013-01-04
13
14
2013-01-06
21
22
混合选择表格中的数据
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [27], in <cell line: 1>()
----> 1 df.ix[:3, ['A', 'C']]
File ~\anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
5568 if (
5569 name not in self._internal_names_set
5570 and name not in self._metadata
5571 and name not in self._accessors
5572 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5573 ):
5574 return self[name]
-> 5575 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'ix'
是或否的筛选
A
B
C
D
2013-01-04
12
13
14
15
2013-01-05
16
17
18
19
2013-01-06
20
21
22
23
3.3pandas 设置值
1 2 3 4 5 import pandas as pdimport numpy as np dates = pd.date_range('20130101' , periods=6 ) df = pd.DataFrame(np.arange(24 ).reshape((6 , 4 )), index=dates, columns=['A' , 'B' , 'C' , 'D' ])
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
A
B
C
D
2013-01-01
0
1
2
3
2013-01-02
4
5
6
7
2013-01-03
8
9
10
11
2013-01-04
12
13
14
15
2013-01-05
16
17
18
19
2013-01-06
20
21
22
23
修改表格中的值
1 2 3 4 5 df.iloc[2 , 2 ] = 1111 df.loc['20130101' , 'B' ] = 2222 df[df.A > 4 ] = 0 df.B[df.A > 4 ] = 0 df
A
B
C
D
2013-01-01
0
2222
2
3
2013-01-02
4
5
6
7
2013-01-03
0
0
1111
0
2013-01-04
0
0
0
0
2013-01-05
0
0
0
0
2013-01-06
0
0
0
0
给表格中添加空的行
A
B
C
D
F
2013-01-01
0
2222
2
3
NaN
2013-01-02
4
5
6
7
NaN
2013-01-03
0
0
1111
0
NaN
2013-01-04
0
0
0
0
NaN
2013-01-05
0
0
0
0
NaN
2013-01-06
0
0
0
0
NaN
1 2 df['E' ] = pd.Series([1 , 2 , 3 , 4 , 5 , 6 ], index=pd.date_range('20130101' , periods=6 )) df
A
B
C
D
F
E
2013-01-01
0
2222
2
3
NaN
1
2013-01-02
4
5
6
7
NaN
2
2013-01-03
0
0
1111
0
NaN
3
2013-01-04
0
0
0
0
NaN
4
2013-01-05
0
0
0
0
NaN
5
2013-01-06
0
0
0
0
NaN
6
3.4 pandas 处理丢失数据
1 2 3 4 5 6 7 8 import pandas as pdimport numpy as np dates = pd.date_range('20130101' , periods=6 ) df = pd.DataFrame(np.arange(24 ).reshape((6 , 4 )), index=dates, columns=['A' , 'B' , 'C' , 'D' ]) df.iloc[0 , 1 ] = np.nan df.iloc[1 , 2 ] = np.nan df
A
B
C
D
2013-01-01
0
NaN
2.0
3
2013-01-02
4
5.0
NaN
7
2013-01-03
8
9.0
10.0
11
2013-01-04
12
13.0
14.0
15
2013-01-05
16
17.0
18.0
19
2013-01-06
20
21.0
22.0
23
丢弃 nan 数据
1 df.dropna(axis=0 , how='any' )
A
B
C
D
2013-01-03
8
9.0
10.0
11
2013-01-04
12
13.0
14.0
15
2013-01-05
16
17.0
18.0
19
2013-01-06
20
21.0
22.0
23
将 nan 数据更改为其他值
A
B
C
D
2013-01-01
0
0.0
2.0
3
2013-01-02
4
5.0
0.0
7
2013-01-03
8
9.0
10.0
11
2013-01-04
12
13.0
14.0
15
2013-01-05
16
17.0
18.0
19
2013-01-06
20
21.0
22.0
23
判断表格中是否有缺失数据
A
B
C
D
2013-01-01
False
True
False
False
2013-01-02
False
False
True
False
2013-01-03
False
False
False
False
2013-01-04
False
False
False
False
2013-01-05
False
False
False
False
2013-01-06
False
False
False
False
A
B
C
D
2013-01-01
False
True
False
False
2013-01-02
False
False
True
False
2013-01-03
False
False
False
False
2013-01-04
False
False
False
False
2013-01-05
False
False
False
False
2013-01-06
False
False
False
False
1 np.any (df.isnull() is True )
False
3.5 3D 数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D fig = plt.figure() ax = Axes3D(fig, auto_add_to_figure=False ) fig.add_axes(ax) X = np.arange(-4 , 4 , 0.25 ) Y = np.arange(-4 , 4 , 0.25 ) X, Y = np.meshgrid(X, Y) R = np.sqrt(X ** 2 + Y ** 2 ) Z = np.sin(R) ax.plot_surface(X, Y, Z, rstride=1 , cstride=1 , cmap=plt.get_cmap('rainbow' )) ax.contourf(X, Y, Z, zdir='z' , offset=-2 , cmap='rainbow' ) ax.set_zlim(-2 , 2 ) plt.show()
3.6 pandas 合并 concat
1 2 import pandas as pd import numpy as np
concatenating
1 2 3 df1 = pd.DataFrame(np.ones((3 , 4 )) * 0 , columns=['a' , 'b' , 'c' , 'd' ]) df2 = pd.DataFrame(np.ones((3 , 4 )) * 1 , columns=['a' , 'b' , 'c' , 'd' ]) df3 = pd.DataFrame(np.ones((3 , 4 )) * 2 , columns=['a' , 'b' , 'c' , 'd' ])
a
b
c
d
0
0.0
0.0
0.0
0.0
1
0.0
0.0
0.0
0.0
2
0.0
0.0
0.0
0.0
a
b
c
d
0
1.0
1.0
1.0
1.0
1
1.0
1.0
1.0
1.0
2
1.0
1.0
1.0
1.0
a
b
c
d
0
2.0
2.0
2.0
2.0
1
2.0
2.0
2.0
2.0
2
2.0
2.0
2.0
2.0
上下合并
1 pd.concat([df1, df2, df3], axis=0 )
a
b
c
d
0
0.0
0.0
0.0
0.0
1
0.0
0.0
0.0
0.0
2
0.0
0.0
0.0
0.0
0
1.0
1.0
1.0
1.0
1
1.0
1.0
1.0
1.0
2
1.0
1.0
1.0
1.0
0
2.0
2.0
2.0
2.0
1
2.0
2.0
2.0
2.0
2
2.0
2.0
2.0
2.0
1 pd.concat([df1, df2, df3], axis=0 , ignore_index=True )
a
b
c
d
0
0.0
0.0
0.0
0.0
1
0.0
0.0
0.0
0.0
2
0.0
0.0
0.0
0.0
3
1.0
1.0
1.0
1.0
4
1.0
1.0
1.0
1.0
5
1.0
1.0
1.0
1.0
6
2.0
2.0
2.0
2.0
7
2.0
2.0
2.0
2.0
8
2.0
2.0
2.0
2.0
join, [‘inner’, ‘outer’]
1 2 3 df1 = pd.DataFrame(np.ones((3 , 4 )) * 0 , columns=['a' , 'b' , 'c' , 'd' ]) df2 = pd.DataFrame(np.ones((3 , 4 )) * 1 , columns=['b' , 'c' , 'd' , 'e' ]) df1
a
b
c
d
0
0.0
0.0
0.0
0.0
1
0.0
0.0
0.0
0.0
2
0.0
0.0
0.0
0.0
b
c
d
e
0
1.0
1.0
1.0
1.0
1
1.0
1.0
1.0
1.0
2
1.0
1.0
1.0
1.0
直接合并出现的结果,会用 nan 填充
1 pd.concat([df1, df2], join='outer' )
a
b
c
d
e
0
0.0
0.0
0.0
0.0
NaN
1
0.0
0.0
0.0
0.0
NaN
2
0.0
0.0
0.0
0.0
NaN
0
NaN
1.0
1.0
1.0
1.0
1
NaN
1.0
1.0
1.0
1.0
2
NaN
1.0
1.0
1.0
1.0
1 pd.concat([df1, df2], join='inner' )
b
c
d
0
0.0
0.0
0.0
1
0.0
0.0
0.0
2
0.0
0.0
0.0
0
1.0
1.0
1.0
1
1.0
1.0
1.0
2
1.0
1.0
1.0
1 pd.concat([df1, df2], join='inner' , ignore_index=True )
b
c
d
0
0.0
0.0
0.0
1
0.0
0.0
0.0
2
0.0
0.0
0.0
3
1.0
1.0
1.0
4
1.0
1.0
1.0
5
1.0
1.0
1.0
按照索引进行合并
1 pd.concat([df1, df2], axis=1 , join_axes=[df1.index])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [23], in <cell line: 1>()
----> 1 pd.concat([df1, df2], axis=1, join_axes=[df1.index])
File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
305 if len(args) > num_allow_args:
306 warnings.warn(
307msg.format(arguments=arguments),
308 FutureWarning,
309stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
TypeError: concat() got an unexpected keyword argument 'join_axes'
append 将被移除,不推荐使用
C:\Users\gzjzx\AppData\Local\Temp\ipykernel_10380\266511466.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df1.append([df2, df3])
a
b
c
d
e
0
0.0
0.0
0.0
0.0
NaN
1
0.0
0.0
0.0
0.0
NaN
2
0.0
0.0
0.0
0.0
NaN
0
NaN
1.0
1.0
1.0
1.0
1
NaN
1.0
1.0
1.0
1.0
2
NaN
1.0
1.0
1.0
1.0
0
2.0
2.0
2.0
2.0
NaN
1
2.0
2.0
2.0
2.0
NaN
2
2.0
2.0
2.0
2.0
NaN
3.7pandas 合并 merge
1 2 3 4 5 6 7 8 import pandas as pd left = pd.DataFrame({'key' : ['K0' , 'K1' , 'K2' , 'K3' ], 'A' : ['A0' , 'A1' , 'A2' , 'A3' ], 'B' : ['B0' , 'B1' , 'B2' , 'B3' ]}) right = pd.DataFrame({'key' : ['K0' , 'K1' , 'K2' , 'K3' ], 'C' : ['C0' , 'C1' , 'C2' , 'C3' ], 'D' : ['D0' , 'D1' , 'D2' , 'D3' ]})
key
A
B
0
K0
A0
B0
1
K1
A1
B1
2
K2
A2
B2
3
K3
A3
B3
key
C
D
0
K0
C0
D0
1
K1
C1
D1
2
K2
C2
D2
3
K3
C3
D3
1 pd.merge(left, right, on='key' )
key
A
B
C
D
0
K0
A0
B0
C0
D0
1
K1
A1
B1
C1
D1
2
K2
A2
B2
C2
D2
3
K3
A3
B3
C3
D3
在有两个 key 的情况下
1 2 3 4 5 6 7 8 9 left = pd.DataFrame({'key1' : ['K0' , 'K0' , 'K1' , 'K2' ], 'key2' : ['K0' , 'K1' , 'K0' , 'K1' ], 'A' : ['A0' , 'A1' , 'A2' , 'A3' ], 'B' : ['B0' , 'B1' , 'B2' , 'B3' ]}) right = pd.DataFrame({'key1' : ['K0' , 'K1' , 'K1' , 'K2' ], 'key2' : ['K0' , 'K0' , 'K0' , 'K0' ], 'C' : ['C0' , 'C1' , 'C2' , 'C3' ], 'D' : ['D0' , 'D1' , 'D2' , 'D3' ]}) left
key1
key2
A
B
0
K0
K0
A0
B0
1
K0
K1
A1
B1
2
K1
K0
A2
B2
3
K2
K1
A3
B3
key1
key2
C
D
0
K0
K0
C0
D0
1
K1
K0
C1
D1
2
K1
K0
C2
D2
3
K2
K0
C3
D3
1 pd.merge(left, right, on=['key1' , 'key2' ], how='inner' )
key1
key2
A
B
C
D
0
K0
K0
A0
B0
C0
D0
1
K1
K0
A2
B2
C1
D1
2
K1
K0
A2
B2
C2
D2
1 pd.merge(left, right, on=['key1' , 'key2' ], how='outer' )
key1
key2
A
B
C
D
0
K0
K0
A0
B0
C0
D0
1
K0
K1
A1
B1
NaN
NaN
2
K1
K0
A2
B2
C1
D1
3
K1
K0
A2
B2
C2
D2
4
K2
K1
A3
B3
NaN
NaN
5
K2
K0
NaN
NaN
C3
D3
1 pd.merge(left, right, on=['key1' , 'key2' ], how='left' )
key1
key2
A
B
C
D
0
K0
K0
A0
B0
C0
D0
1
K0
K1
A1
B1
NaN
NaN
2
K1
K0
A2
B2
C1
D1
3
K1
K0
A2
B2
C2
D2
4
K2
K1
A3
B3
NaN
NaN
1 pd.merge(left, right, on=['key1' , 'key2' ], how='right' , indicator=True )
key1
key2
A
B
C
D
_merge
0
K0
K0
A0
B0
C0
D0
both
1
K1
K0
A2
B2
C1
D1
both
2
K1
K0
A2
B2
C2
D2
both
3
K2
K0
NaN
NaN
C3
D3
right_only
1 pd.merge(left, right, on=['key1' , 'key2' ], how='right' , indicator="indicator_column" )
key1
key2
A
B
C
D
indicator_column
0
K0
K0
A0
B0
C0
D0
both
1
K1
K0
A2
B2
C1
D1
both
2
K1
K0
A2
B2
C2
D2
both
3
K2
K0
NaN
NaN
C3
D3
right_only
1 pd.merge(left, right, left_index=True , right_index=True , how='outer' )
key1_x
key2_x
A
B
key1_y
key2_y
C
D
0
K0
K0
A0
B0
K0
K0
C0
D0
1
K0
K1
A1
B1
K1
K0
C1
D1
2
K1
K0
A2
B2
K1
K0
C2
D2
3
K2
K1
A3
B3
K2
K0
C3
D3
处理合并造成的重复问题
1 2 3 boys = pd.DataFrame({'k' : ['K0' , 'K1' , 'K2' ], 'age' : [1 , 2 , 3 ]}) girls = pd.DataFrame({'k' : ['K0' , 'K1' , 'K2' ], 'age' : [4 , 5 , 6 ]}) boys
k
age
0
K0
1
1
K1
2
2
K2
3
k
age
0
K0
4
1
K1
5
2
K2
6
1 pd.merge(boys, girls, on='k' , suffixes=['_boy' , '_girl' ], how='inner' )
k
age_boy
age_girl
0
K0
1
4
1
K1
2
5
2
K2
3
6
3.8 pandas plot 画图
1 2 3 import pandas as pdimport numpy as npimport matplotlib.pyplot as plt
0 0.682404
1 1.833947
2 2.639884
3 1.843226
4 3.675730
...
995 72.472419
996 70.507220
997 70.897704
998 71.912970
999 71.542112
Length: 1000, dtype: float64
Series 折线图
1 2 data = pd.Series(np.random.randn(1000 ), index=np.arange(1000 )).cumsum() data
0 -1.287330
1 -1.293898
2 -0.166798
3 -1.177414
4 -1.890751
...
995 10.204522
996 10.308410
997 7.887580
998 8.327458
999 9.105820
Length: 1000, dtype: float64
DataFrame 折线图
1 2 data = pd.DataFrame(np.random.randn(1000 , 4 ), index=np.arange(1000 ), columns=list ('ABCD' )).cumsum() data
A
B
C
D
0
0.020454
-1.172689
1.843716
1.544707
1
-0.097744
0.989506
1.068073
2.404901
2
-0.273196
0.758510
-1.905761
3.836764
3
0.401259
1.408281
-2.043875
3.762166
4
1.821346
0.941836
-1.549852
4.347704
...
...
...
...
...
995
-4.775217
-18.948179
-47.410079
-53.719370
996
-3.830409
-19.661865
-46.040664
-54.201339
997
-3.061518
-20.368645
-46.068266
-54.682348
998
-3.447858
-21.113475
-46.808006
-54.889304
999
-2.270221
-21.544267
-46.512128
-54.953465
1000 rows × 4 columns
plot 的方法:bar 条形图,hist,box,kde,area,scatter,hexbin,pie……
1 2 3 ax = data.plot.scatter(x='A' , y='B' , color='DarkBlue' , label='Class1' ) data.plot.scatter(x='A' , y='C' , color='DarkGreen' , label='Class2' , ax=ax) plt.show()