Python-莫烦python学习笔记(pandas)

Python pandas 库的初尝试。学习自莫烦python。

正文

3.1 pandas 基本介绍

如果把 numpy 比作列表,pandas 可以比作字典

python
import pandas as pd
import numpy as np

创建 pandas 序列

python
pd.Series([1, 3, 6, np.nan, 44, 1]) # np.nan 相当于 None
0     1.0
1     3.0
2     6.0
3     NaN
4    44.0
5     1.0
dtype: float64

创建一个日期序列

python
dates = pd.date_range('20160101', periods=6)
dates
DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', freq='D')

生成数据表格, pd.DataFrame(数据, 索引, 标题列表)

python
pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c', 'd'])
abcd
2016-01-010.0145130.490584-1.985363-1.734158
2016-01-021.6942160.383375-1.260541-0.126581
2016-01-030.4755471.0502390.8970931.155942
2016-01-040.126726-1.169920-1.876652-1.245558
2016-01-050.6061190.648469-1.367697-0.822617
2016-01-06-0.6150750.557680-2.1047940.114070

默认生成的数据表格

python
pd.DataFrame(np.arange(12).reshape((3, 4)))
0123
00123
14567
2891011

用字典代替要输入的值

python
df = pd.DataFrame({'A': 1,
                   'B': pd.Timestamp('20130102'),
                   'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                   'D': np.array([3] * 4, dtype='int32'),
                   'E': pd.Categorical(["test", "train", "test", "train"]),
                   'F': 'foo'})
df
ABCDEF
012013-01-021.03testfoo
112013-01-021.03trainfoo
212013-01-021.03testfoo
312013-01-021.03trainfoo

输出表格的类型

python
df.dtypes
A             int64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

输出表格的索引

python
df.index
Int64Index([0, 1, 2, 3], dtype='int64')

输出表格的行名

python
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

输出表格每行的数据

python
df.values
array([[1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

输出表格每行的数据特征

python
df.describe()
ACD
count4.04.04.0
mean1.01.03.0
std0.00.00.0
min1.01.03.0
25%1.01.03.0
50%1.01.03.0
75%1.01.03.0
max1.01.03.0

转置表格

python
df.T
0123
A1111
B2013-01-02 00:00:002013-01-02 00:00:002013-01-02 00:00:002013-01-02 00:00:00
C1.01.01.01.0
D3333
Etesttraintesttrain
Ffoofoofoofoo

排序表格

python
df.sort_index(axis=1, ascending=False) # 以行排序,倒序
FEDCBA
0footest31.02013-01-021
1footrain31.02013-01-021
2footest31.02013-01-021
3footrain31.02013-01-021
python
df.sort_index(axis=0, ascending=False)
ABCDEF
312013-01-021.03trainfoo
212013-01-021.03testfoo
112013-01-021.03trainfoo
012013-01-021.03testfoo
python
df.sort_values(by='E')
ABCDEF
012013-01-021.03testfoo
212013-01-021.03testfoo
112013-01-021.03trainfoo
312013-01-021.03trainfoo

3.2 pandas 选择数据

python
import pandas as pd
import numpy as np
 
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
df
ABCD
2013-01-010123
2013-01-024567
2013-01-03891011
2013-01-0412131415
2013-01-0516171819
2013-01-0620212223

选择表格中 A 列的所有数据

python
df.A
2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int32
python
df['A']
2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int32

选择表格中某些行中所有的数据

python
df[0:3]
ABCD
2013-01-010123
2013-01-024567
2013-01-03891011
python
df['20130102':'20130104']
ABCD
2013-01-024567
2013-01-03891011
2013-01-0412131415

根据标签选择数据

python
df.loc['20130102']
A    4
B    5
C    6
D    7
Name: 2013-01-02 00:00:00, dtype: int32
python
df.loc[:, ['A', 'B']]
AB
2013-01-0101
2013-01-0245
2013-01-0389
2013-01-041213
2013-01-051617
2013-01-062021
python
df.loc['20130102', ['A', 'B']]
A    4
B    5
Name: 2013-01-02 00:00:00, dtype: int32

根据位置选择数据

python
df.iloc[3]
A    12
B    13
C    14
D    15
Name: 2013-01-04 00:00:00, dtype: int32
python
df.iloc[3, 1]
13
python
df.iloc[3:5, 1:3]
BC
2013-01-041314
2013-01-051718
python
df.iloc[[1, 3, 5], 1:3]
BC
2013-01-0256
2013-01-041314
2013-01-062122

混合选择表格中的数据

python
df.ix[:3, ['A', 'C']] # 高版本中已不推荐使用
---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

Input In [27], in <cell line: 1>()
----> 1 df.ix[:3, ['A', 'C']]


File ~\anaconda3\lib\site-packages\pandas\core\generic.py:5575, in NDFrame.__getattr__(self, name)
   5568 if (
   5569     name not in self._internal_names_set
   5570     and name not in self._metadata
   5571     and name not in self._accessors
   5572     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5573 ):
   5574     return self[name]
-> 5575 return object.__getattribute__(self, name)


AttributeError: 'DataFrame' object has no attribute 'ix'

是或否的筛选

python
df[df.A > 8] # 只对比 A,但是 ABCD 都会显示
ABCD
2013-01-0412131415
2013-01-0516171819
2013-01-0620212223

3.3pandas 设置值

python
import pandas as pd
import numpy as np
 
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
python
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
python
df
ABCD
2013-01-010123
2013-01-024567
2013-01-03891011
2013-01-0412131415
2013-01-0516171819
2013-01-0620212223

修改表格中的值

python
df.iloc[2, 2] = 1111
df.loc['20130101', 'B'] = 2222
df[df.A > 4] = 0
df.B[df.A > 4] = 0
df
ABCD
2013-01-010222223
2013-01-024567
2013-01-030011110
2013-01-040000
2013-01-050000
2013-01-060000

给表格中添加空的行

python
df['F'] = np.nan
df
ABCDF
2013-01-010222223NaN
2013-01-024567NaN
2013-01-030011110NaN
2013-01-040000NaN
2013-01-050000NaN
2013-01-060000NaN
python
df['E'] = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130101', periods=6))
df
ABCDFE
2013-01-010222223NaN1
2013-01-024567NaN2
2013-01-030011110NaN3
2013-01-040000NaN4
2013-01-050000NaN5
2013-01-060000NaN6

3.4 pandas 处理丢失数据

python
import pandas as pd
import numpy as np
 
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
df.iloc[0, 1] = np.nan
df.iloc[1, 2] = np.nan
df
ABCD
2013-01-010NaN2.03
2013-01-0245.0NaN7
2013-01-0389.010.011
2013-01-041213.014.015
2013-01-051617.018.019
2013-01-062021.022.023

丢弃 nan 数据

python
df.dropna(axis=0, how='any') # how = {'any', 'all'} 出现 nan/所有数据都是 nan 是才处理
ABCD
2013-01-0389.010.011
2013-01-041213.014.015
2013-01-051617.018.019
2013-01-062021.022.023

将 nan 数据更改为其他值

python
df.fillna(value=0)
ABCD
2013-01-0100.02.03
2013-01-0245.00.07
2013-01-0389.010.011
2013-01-041213.014.015
2013-01-051617.018.019
2013-01-062021.022.023

判断表格中是否有缺失数据

python
df.isna()
ABCD
2013-01-01FalseTrueFalseFalse
2013-01-02FalseFalseTrueFalse
2013-01-03FalseFalseFalseFalse
2013-01-04FalseFalseFalseFalse
2013-01-05FalseFalseFalseFalse
2013-01-06FalseFalseFalseFalse
python
df.isnull()
ABCD
2013-01-01FalseTrueFalseFalse
2013-01-02FalseFalseTrueFalse
2013-01-03FalseFalseFalseFalse
2013-01-04FalseFalseFalseFalse
2013-01-05FalseFalseFalseFalse
2013-01-06FalseFalseFalseFalse
python
np.any(df.isnull() is True) # 若返回 True,则说明表格中的确缺失数据
False

3.5 3D 数据

python
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
fig = plt.figure()  # 新建一个 figure 窗口
ax = Axes3D(fig, auto_add_to_figure=False)  # 添加三维坐标轴
fig.add_axes(ax)
 
# 输入数据
X = np.arange(-4, 4, 0.25)
Y = np.arange(-4, 4, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X ** 2 + Y ** 2)
Z = np.sin(R)
 
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.get_cmap('rainbow'))
ax.contourf(X, Y, Z, zdir='z', offset=-2, cmap='rainbow')
ax.set_zlim(-2, 2)
 
plt.show()
png

3.6 pandas 合并 concat

python
import pandas as pd 
import numpy as np

concatenating

python
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])
df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd'])
python
df1
abcd
00.00.00.00.0
10.00.00.00.0
20.00.00.00.0
python
df2
abcd
01.01.01.01.0
11.01.01.01.0
21.01.01.01.0
python
df3
abcd
02.02.02.02.0
12.02.02.02.0
22.02.02.02.0

上下合并

python
pd.concat([df1, df2, df3], axis=0) # axis=0 竖向划分操作行,1 横向划分操作列
abcd
00.00.00.00.0
10.00.00.00.0
20.00.00.00.0
01.01.01.01.0
11.01.01.01.0
21.01.01.01.0
02.02.02.02.0
12.02.02.02.0
22.02.02.02.0
python
pd.concat([df1, df2, df3], axis=0, ignore_index=True) # 重新排序索引
abcd
00.00.00.00.0
10.00.00.00.0
20.00.00.00.0
31.01.01.01.0
41.01.01.01.0
51.01.01.01.0
62.02.02.02.0
72.02.02.02.0
82.02.02.02.0

join, ['inner', 'outer']

python
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'])
df1
abcd
00.00.00.00.0
10.00.00.00.0
20.00.00.00.0
python
df2
bcde
01.01.01.01.0
11.01.01.01.0
21.01.01.01.0

直接合并出现的结果,会用 nan 填充

python
pd.concat([df1, df2], join='outer')
abcde
00.00.00.00.0NaN
10.00.00.00.0NaN
20.00.00.00.0NaN
0NaN1.01.01.01.0
1NaN1.01.01.01.0
2NaN1.01.01.01.0
python
pd.concat([df1, df2], join='inner') # 只会合并相同的部分
bcd
00.00.00.0
10.00.00.0
20.00.00.0
01.01.01.0
11.01.01.0
21.01.01.0
python
pd.concat([df1, df2], join='inner', ignore_index=True)
bcd
00.00.00.0
10.00.00.0
20.00.00.0
31.01.01.0
41.01.01.0
51.01.01.0

按照索引进行合并

python
pd.concat([df1, df2], axis=1, join_axes=[df1.index])
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

Input In [23], in <cell line: 1>()
----> 1 pd.concat([df1, df2], axis=1, join_axes=[df1.index])


File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307msg.format(arguments=arguments),
    308         FutureWarning,
    309stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)


TypeError: concat() got an unexpected keyword argument 'join_axes'

append 将被移除,不推荐使用

python
df1.append([df2, df3])
C:\Users\gzjzx\AppData\Local\Temp\ipykernel_10380\266511466.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df1.append([df2, df3])
abcde
00.00.00.00.0NaN
10.00.00.00.0NaN
20.00.00.00.0NaN
0NaN1.01.01.01.0
1NaN1.01.01.01.0
2NaN1.01.01.01.0
02.02.02.02.0NaN
12.02.02.02.0NaN
22.02.02.02.0NaN

3.7pandas 合并 merge

python
import pandas as pd
 
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A':   ['A0', 'A1', 'A2', 'A3'],
                     'B':   ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'C':   ['C0', 'C1', 'C2', 'C3'],
                      'D':   ['D0', 'D1', 'D2', 'D3']})
python
left
keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3
python
right
keyCD
0K0C0D0
1K1C1D1
2K2C2D2
3K3C3D3
python
pd.merge(left, right, on='key')
keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K2A2B2C2D2
3K3A3B3C3D3

在有两个 key 的情况下

python
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                     'A':   ['A0', 'A1', 'A2', 'A3'],
                     'B':   ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                      'C':   ['C0', 'C1', 'C2', 'C3'],
                      'D':   ['D0', 'D1', 'D2', 'D3']})
left
key1key2AB
0K0K0A0B0
1K0K1A1B1
2K1K0A2B2
3K2K1A3B3
python
right
key1key2CD
0K0K0C0D0
1K1K0C1D1
2K1K0C2D2
3K2K0C3D3
python
pd.merge(left, right, on=['key1', 'key2'], how='inner')
key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
python
pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
5K2K0NaNNaNC3D3
python
pd.merge(left, right, on=['key1', 'key2'], how='left')
key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
python
pd.merge(left, right, on=['key1', 'key2'], how='right', indicator=True) # indicator 显示合并的方式
key1key2ABCD_merge
0K0K0A0B0C0D0both
1K1K0A2B2C1D1both
2K1K0A2B2C2D2both
3K2K0NaNNaNC3D3right_only
python
pd.merge(left, right, on=['key1', 'key2'], how='right', indicator="indicator_column") # indicator 显示合并的方式
key1key2ABCDindicator_column
0K0K0A0B0C0D0both
1K1K0A2B2C1D1both
2K1K0A2B2C2D2both
3K2K0NaNNaNC3D3right_only
python
pd.merge(left, right, left_index=True, right_index=True, how='outer')
key1_xkey2_xABkey1_ykey2_yCD
0K0K0A0B0K0K0C0D0
1K0K1A1B1K1K0C1D1
2K1K0A2B2K1K0C2D2
3K2K1A3B3K2K0C3D3

处理合并造成的重复问题

python
boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [4, 5, 6]})
boys
kage
0K01
1K12
2K23
python
girls
kage
0K04
1K15
2K26
python
pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
kage_boyage_girl
0K014
1K125
2K236

3.8 pandas plot 画图

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
0       0.682404
1       1.833947
2       2.639884
3       1.843226
4       3.675730
         ...    
995    72.472419
996    70.507220
997    70.897704
998    71.912970
999    71.542112
Length: 1000, dtype: float64

Series 折线图

python
data = pd.Series(np.random.randn(1000), index=np.arange(1000)).cumsum()
data
0      -1.287330
1      -1.293898
2      -0.166798
3      -1.177414
4      -1.890751
         ...    
995    10.204522
996    10.308410
997     7.887580
998     8.327458
999     9.105820
Length: 1000, dtype: float64
python
data.plot()
plt.show()
png

DataFrame 折线图

python
data = pd.DataFrame(np.random.randn(1000, 4), index=np.arange(1000), columns=list('ABCD')).cumsum()
data
ABCD
00.020454-1.1726891.8437161.544707
1-0.0977440.9895061.0680732.404901
2-0.2731960.758510-1.9057613.836764
30.4012591.408281-2.0438753.762166
41.8213460.941836-1.5498524.347704
...............
995-4.775217-18.948179-47.410079-53.719370
996-3.830409-19.661865-46.040664-54.201339
997-3.061518-20.368645-46.068266-54.682348
998-3.447858-21.113475-46.808006-54.889304
999-2.270221-21.544267-46.512128-54.953465

1000 rows × 4 columns

python
data.plot()
plt.show()

png

plot 的方法:bar 条形图,hist,box,kde,area,scatter,hexbin,pie……

python
ax = data.plot.scatter(x='A', y='B', color='DarkBlue', label='Class1')
data.plot.scatter(x='A', y='C', color='DarkGreen', label='Class2', ax=ax)
plt.show()
png