DataFrames常见方法(一)
一些常见的方法
- head,默认获取前五行数据,可以传入想获取的行数
>>> df = pd.DataFrame({'animal':['alligator', 'bee', 'falcon', 'lion', ... 'monkey', 'parrot', 'shark', 'whale', 'zebra']}) >>> df animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra >>> df.head() animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey >>> df.head() animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey
- tail,默认获取后五行数据,可以传入想获取的行数
>>> df.tail() animal 4 monkey 5 parrot 6 shark 7 whale 8 zebra >>> df.tail(3) animal 6 shark 7 whale 8 zebra
- shape,查看DataFrame的行列个数
>>> df.shape (9, 1)
- info,查看索引、数据类型和内存信息
>>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 9 entries, 0 to 8 Data columns (total 1 columns): animal 9 non-null object dtypes: object(1) memory usage: 152.0+ bytes
- mean,所有列的平均数
>>> df = pd.DataFrame(np.random.rand(5,5)) >>> df 0 1 2 3 4 0 0.987926 0.556055 0.774863 0.926501 0.029973 1 0.635812 0.698311 0.402425 0.727675 0.048129 2 0.001094 0.329329 0.364231 0.754038 0.405464 3 0.975270 0.388988 0.598047 0.355597 0.189753 4 0.171976 0.334893 0.931219 0.967504 0.323952 >>> df.mean() 0 0.554416 1 0.461516 2 0.614157 3 0.746263 4 0.199454 dtype: float64
- count,每一列中非空值的个数
>>> df.count() 0 5 1 5 2 5 3 5 4 5 dtype: int64
- max,每一列的最大值
>>> df.max() 0 0.987926 1 0.698311 2 0.931219 3 0.967504 4 0.405464 dtype: float64
- min,每一列的最小值
>>> df.min() 0 0.001094 1 0.329329 2 0.364231 3 0.355597 4 0.029973 dtype: float64
- median,每一列的中位数
>>> df.median() 0 0.635812 1 0.388988 2 0.598047 3 0.754038 4 0.189753 dtype: float64
- std,每一列的标准差
>>> df.std() 0 0.453900 1 0.161072 2 0.241820 3 0.242105 4 0.165573 dtype: float64
- corr,列与列之间的相关系数
>>> df.corr() 0 1 2 3 4 0 1.000000 0.517374 0.142778 -0.401999 -0.836540 1 0.517374 1.000000 -0.262423 0.076483 -0.882558 2 0.142778 -0.262423 1.000000 0.458608 -0.044045 3 -0.401999 0.076483 0.458608 1.000000 0.032442 4 -0.836540 -0.882558 -0.044045 0.032442 1.000000
- head,默认获取前五行数据,可以传入想获取的行数
获取数据
切片操作可以运用在一下所有的方法里
df[columns],获取列,返回列,数据类型为Series
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],index=['cobra', 'viper', 'sidewinder'],columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df['max_speed'] cobra 1 viper 4 sidewinder 7 Name: max_speed, dtype: int64
df[columns1,columns2],返回多列,数据类型为DataFrame
>>> df[['max_speed','shield']] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
df.loc[0,0],通过对应行列的索引名称来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的Series
>>> df.loc["cobra","max_speed"] 1 >>> df.loc["cobra":,"max_speed":] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.loc["cobra":,] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
df.iloc[0,0],通过位置来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的Series
>>> df.iloc[0,0] 1 >>> df.iloc[0:,0:] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.iloc[0:,] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.loc["cobra":,]
df.ix[0,0],结合了loc与iloc,既可以通过位置,又可以通过索引名。当只填行或者列的时候返回一个行或者列的Series
>>> df.ix['viper',1:2] shield 5 Name: viper, dtype: int64 >>> df.ix['viper',0:1] max_speed 4 Name: viper, dtype: int64
df.values[:,:],通过位置返回所有的数据,当只填行或者列的时候返回一个行或者列的array
>>> df.values[0:,:] array([[1, 2], [4, 5], [7, 8]], dtype=int64) >>> df.values[:1,:] array([[1, 2]], dtype=int64) >>>
df[df[columns]>10],根据条件选出符合条件的列
>>> df[df['max_speed']>0] max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
df.sort_values([columns1,columns2],ascending=[False,True]),按照某列的升降序排列,当填入两个以上的列时,按照先后顺序升降序排列
>>> df.sort_values('max_speed') max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.sort_values('max_speed',ascending=False) max_speed shield sidewinder 7 8 viper 4 5 cobra 1 2 >>> df.sort_values(['max_speed','shield'],ascending=False) max_speed shield sidewinder 7 8 viper 4 5 cobra 1 2
df.groupby([columns1,columns2]),按一列或者多列进行分组,返回分组对象
>>> df.groupby('max_speed') <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001DF64584588> >>> df.groupby('max_speed').mean() shield max_speed 1 2 4 5 7 8
数据清洗
- df.columns = [‘a’,’b’,’c’,’d’],重新给列命名
>>> df.columns=['a','b'] >>> df a b cobra 1 2 viper 4 5 sidewinder 7 8
- df.rename(data,axis),改变行索引或者列索引,axis里选择行列
>>> df.rename(index=str,columns={'a':'A','b':'B'}) A B cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.rename(str.lower,axis='columns') a b cobra 1 2 viper 4 5 sidewinder 7 8 >>> df.rename({'cobra':'A','viper':'B','sidewinder':'C'},axis='index') a b A 1 2 B 4 5 C 7 8
- df.set_index(‘column_one’):设置索引列
>>> df = pd.DataFrame({'month': [1, 4, 7, 10], ... 'year': [2012, 2014, 2013, 2014], ... 'sale': [55, 40, 84, 31]}) >>> df month year sale 0 1 2012 55 1 4 2014 40 2 7 2013 84 3 10 2014 31 >>> df.set_index('month') year sale month 1 2012 55 4 2014 40 7 2013 84 10 2014 31
- df.reset_index,重新设置行索引
>>> df = pd.DataFrame([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal NaN >>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
- df.isnull,判断DataFrame中有没有空值,有空值返回True,没有返回False
>>> df = pd.DataFrame({'age': [5, 6, np.NaN], ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker >>> df.isnull() age born name toy 0 False True False True 1 False False False False 2 True False False False >>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
- df.notnull,判断DataFrame中有没有非空值,有非空值返回True,没有返回False
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
- df.dropna(axis),删除所有包含空值的行或者列,axis=0为行,axis=1为列
>>> df.dropna() age born name toy 1 6.0 1939-05-27 Batman Batmobile >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
- df.fillna(n),用n来替换DataFrame中的所有空值
>>> df.fillna('hahaha') age born name toy 0 5 hahaha Alfred hahaha 1 6 1939-05-27 00:00:00 Batman Batmobile 2 hahaha 1940-04-25 00:00:00 Joker
- df.replace(‘a’,’b’),用’b’来替换DataFrme中所有的’a’
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4], ... 'B': [5, 6, 7, 8, 9], ... 'C': ['a', 'b', 'c', 'd', 'e']}) >>> df.replace(0, 5) A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> df.replace(0,5) A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e
- df[[‘a’,’b’]].astype(type),改变DataFrme中某几列的数据类型,即改变Series的数据类型
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) >>> df.rename(index=str, columns={"A": "a", "B": "c"}) a c 0 1 4 1 2 5 2 3 6 df = pd.DataFrame([('bird', 389.0),('bird', 24.0),('mammal', 80.5),('mammal', np.nan)],index=['falcon', 'parrot', 'lion', 'monkey'],columns=('class', 'max_speed'))
- df.columns = [‘a’,’b’,’c’,’d’],重新给列命名
数据合并
- df.append(df2),将df2中的数据根据列追加到df中的末尾,注意如果两个df的列名不相同,会显示所有列,在没有的列添加NaN
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) >>> df A B 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) >>> df2 A B 0 5 6 1 7 8 >>> df.append(df2) A B 0 1 2 1 3 4 0 5 6 1 7 8 >>> df3 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BC')) >>> df3 B C 0 5 6 1 7 8 >>> df.append(df3,sort=True) # sort设置排序规则 A B C 0 1.0 2 NaN 1 3.0 4 NaN 0 NaN 5 6.0 1 NaN 7 8.0 >>> df.append(df3,sort=True,ignore_index=True) # ignore_index重新设置索引 A B C 0 1.0 2 NaN 1 3.0 4 NaN 2 NaN 5 6.0 3 NaN 7 8.0
- pd.concat([df1,df2],axis=1),将df2中的数据根据axis选择行列追加到df1的尾部
>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number']) >>> df1 letter number 0 a 1 1 b 2 >>> df2 = pd.DataFrame([['c', 3], ['d', 4]],columns=['letter', 'number']) >>> df2 letter number 0 c 3 1 d 4 >>> pd.concat([df1,df2]) letter number 0 a 1 1 b 2 0 c 3 1 d 4 >>> pd.concat([df1,df2],axis=1) letter number letter number 0 a 1 c 3 1 b 2 d 4
- df1.join(df2,on=columns,how=’out’)
>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}) >>> df key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K3 A3 4 K4 A4 5 K5 A5 >>> df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']}) >>> df2 key B 0 K0 B0 1 K1 B1 2 K2 B2 >>> df.join(df2,lsuffix='_caller',rsuffix='_other') # lsuffix设置df左侧的重叠列中使用的列名,同理rsuffix为右侧 key_caller A key_other B 0 K0 A0 K0 B0 1 K1 A1 K1 B1 2 K2 A2 K2 B2 3 K3 A3 NaN NaN 4 K4 A4 NaN NaN 5 K5 A5 NaN NaN
- df1.merge(df2,left_on=’column1’,right_on=’column2’)
>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 5]}) >>> df1 lkey value 0 foo 1 1 bar 2 2 baz 3 3 foo 5 >>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]}) >>> df2 rkey value 0 foo 5 1 bar 6 2 baz 7 3 foo 8 >>> df1.merge(df2,left_on='lkey',right_on='rkey') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 1 foo 8 2 foo 5 foo 5 3 foo 5 foo 8 4 bar 2 bar 6 5 baz 3 baz 7
- df.append(df2),将df2中的数据根据列追加到df中的末尾,注意如果两个df的列名不相同,会显示所有列,在没有的列添加NaN
推荐文章: