DataFrames常见方法(一)

未匹配的标注
  • 一些常见的方法

    • head,默认获取前五行数据,可以传入想获取的行数
      >>> df = pd.DataFrame({'animal':['alligator', 'bee', 'falcon', 'lion',
      ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
      >>> df
            animal
      0  alligator
      1        bee
      2     falcon
      3       lion
      4     monkey
      5     parrot
      6      shark
      7      whale
      8      zebra
      >>> df.head()
            animal
      0  alligator
      1        bee
      2     falcon
      3       lion
      4     monkey
      >>> df.head()
            animal
      0  alligator
      1        bee
      2     falcon
      3       lion
      4     monkey
    • tail,默认获取后五行数据,可以传入想获取的行数
      >>> df.tail()
         animal
      4  monkey
      5  parrot
      6   shark
      7   whale
      8   zebra
      >>> df.tail(3)
        animal
      6  shark
      7  whale
      8  zebra
    • shape,查看DataFrame的行列个数
      >>> df.shape
      (9, 1)
    • info,查看索引、数据类型和内存信息
      >>> df.info()
      <class 'pandas.core.frame.DataFrame'>
      RangeIndex: 9 entries, 0 to 8
      Data columns (total 1 columns):
      animal    9 non-null object
      dtypes: object(1)
      memory usage: 152.0+ bytes
    • mean,所有列的平均数
      >>> df = pd.DataFrame(np.random.rand(5,5))
      >>> df
                0         1         2         3         4
      0  0.987926  0.556055  0.774863  0.926501  0.029973
      1  0.635812  0.698311  0.402425  0.727675  0.048129
      2  0.001094  0.329329  0.364231  0.754038  0.405464
      3  0.975270  0.388988  0.598047  0.355597  0.189753
      4  0.171976  0.334893  0.931219  0.967504  0.323952
      >>> df.mean()
      0    0.554416
      1    0.461516
      2    0.614157
      3    0.746263
      4    0.199454
      dtype: float64
    • count,每一列中非空值的个数
      >>> df.count()
      0    5
      1    5
      2    5
      3    5
      4    5
      dtype: int64
    • max,每一列的最大值
      >>> df.max()
      0    0.987926
      1    0.698311
      2    0.931219
      3    0.967504
      4    0.405464
      dtype: float64
    • min,每一列的最小值
      >>> df.min()
      0    0.001094
      1    0.329329
      2    0.364231
      3    0.355597
      4    0.029973
      dtype: float64
    • median,每一列的中位数
      >>> df.median()
      0    0.635812
      1    0.388988
      2    0.598047
      3    0.754038
      4    0.189753
      dtype: float64
    • std,每一列的标准差
      >>> df.std()
      0    0.453900
      1    0.161072
      2    0.241820
      3    0.242105
      4    0.165573
      dtype: float64
    • corr,列与列之间的相关系数
      >>> df.corr()
                0         1         2         3         4
      0  1.000000  0.517374  0.142778 -0.401999 -0.836540
      1  0.517374  1.000000 -0.262423  0.076483 -0.882558
      2  0.142778 -0.262423  1.000000  0.458608 -0.044045
      3 -0.401999  0.076483  0.458608  1.000000  0.032442
      4 -0.836540 -0.882558 -0.044045  0.032442  1.000000
  • 获取数据

    • 切片操作可以运用在一下所有的方法里

    • df[columns],获取列,返回列,数据类型为Series

      >>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],index=['cobra', 'viper', 'sidewinder'],columns=['max_speed', 'shield'])
      >>> df
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df['max_speed']
      cobra         1
      viper         4
      sidewinder    7
      Name: max_speed, dtype: int64
    • df[columns1,columns2],返回多列,数据类型为DataFrame

      >>> df[['max_speed','shield']]
          max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
    • df.loc[0,0],通过对应行列的索引名称来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的Series

      >>> df.loc["cobra","max_speed"]
      1
      >>> df.loc["cobra":,"max_speed":]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.loc["cobra":,]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      
    • df.iloc[0,0],通过位置来获取,当填入行列时返回单个元素,当只填行或者列的时候返回一个行或者列的Series

      >>> df.iloc[0,0]
      1
      >>> df.iloc[0:,0:]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.iloc[0:,]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.loc["cobra":,]
    • df.ix[0,0],结合了loc与iloc,既可以通过位置,又可以通过索引名。当只填行或者列的时候返回一个行或者列的Series

              >>> df.ix['viper',1:2]
      shield    5
      Name: viper, dtype: int64
      >>> df.ix['viper',0:1]
      max_speed    4
      Name: viper, dtype: int64
    • df.values[:,:],通过位置返回所有的数据,当只填行或者列的时候返回一个行或者列的array

      >>> df.values[0:,:]
      array([[1, 2],
             [4, 5],
             [7, 8]], dtype=int64)
      >>> df.values[:1,:]
      array([[1, 2]], dtype=int64)
      >>>
    • df[df[columns]>10],根据条件选出符合条件的列

      >>> df[df['max_speed']>0]
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
    • df.sort_values([columns1,columns2],ascending=[False,True]),按照某列的升降序排列,当填入两个以上的列时,按照先后顺序升降序排列

      >>> df.sort_values('max_speed')
                  max_speed  shield
      cobra               1       2
      viper               4       5
      sidewinder          7       8
      >>> df.sort_values('max_speed',ascending=False)
                  max_speed  shield
      sidewinder          7       8
      viper               4       5
      cobra               1       2
      >>> df.sort_values(['max_speed','shield'],ascending=False)
                  max_speed  shield
      sidewinder          7       8
      viper               4       5
      cobra               1       2
    • df.groupby([columns1,columns2]),按一列或者多列进行分组,返回分组对象

      >>> df.groupby('max_speed')
      <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001DF64584588>
      >>> df.groupby('max_speed').mean()
                 shield
      max_speed
      1               2
      4               5
      7               8
  • 数据清洗

    • df.columns = [‘a’,’b’,’c’,’d’],重新给列命名
      >>> df.columns=['a','b']
      >>> df
                  a  b
      cobra       1  2
      viper       4  5
      sidewinder  7  8
    • df.rename(data,axis),改变行索引或者列索引,axis里选择行列
      >>> df.rename(index=str,columns={'a':'A','b':'B'})
                  A  B
      cobra       1  2
      viper       4  5
      sidewinder  7  8
      >>> df.rename(str.lower,axis='columns')
                  a  b
      cobra       1  2
      viper       4  5
      sidewinder  7  8
      >>> df.rename({'cobra':'A','viper':'B','sidewinder':'C'},axis='index')
         a  b
      A  1  2
      B  4  5
      C  7  8
    • df.set_index(‘column_one’):设置索引列
      >>> df = pd.DataFrame({'month': [1, 4, 7, 10],
      ...                    'year': [2012, 2014, 2013, 2014],
      ...                    'sale': [55, 40, 84, 31]})
      >>> df
         month  year  sale
      0      1  2012    55
      1      4  2014    40
      2      7  2013    84
      3     10  2014    31
      >>> df.set_index('month')
             year  sale
      month
      1      2012    55
      4      2014    40
      7      2013    84
      10     2014    31
    • df.reset_index,重新设置行索引
      >>> df = pd.DataFrame([('bird', 389.0),
      ...                    ('bird', 24.0),
      ...                    ('mammal', 80.5),
      ...                    ('mammal', np.nan)],
      ...                   index=['falcon', 'parrot', 'lion', 'monkey'],
      ...                   columns=('class', 'max_speed'))
      >>> df
               class  max_speed
      falcon    bird      389.0
      parrot    bird       24.0
      lion    mammal       80.5
      monkey  mammal        NaN
      >>> df.reset_index()
          index   class  max_speed
      0  falcon    bird      389.0
      1  parrot    bird       24.0
      2    lion  mammal       80.5
      3  monkey  mammal        NaN
    • df.isnull,判断DataFrame中有没有空值,有空值返回True,没有返回False
      >>> df = pd.DataFrame({'age': [5, 6, np.NaN],
      ...                    'born': [pd.NaT, pd.Timestamp('1939-05-27'),
      ...                             pd.Timestamp('1940-04-25')],
      ...                    'name': ['Alfred', 'Batman', ''],
      ...                    'toy': [None, 'Batmobile', 'Joker']})
      >>> df
         age       born    name        toy
      0  5.0        NaT  Alfred       None
      1  6.0 1939-05-27  Batman  Batmobile
      2  NaN 1940-04-25              Joker
      >>> df.isnull()
           age   born   name    toy
      0  False   True  False   True
      1  False  False  False  False
      2   True  False  False  False
      >>> df.isna()
           age   born   name    toy
      0  False   True  False   True
      1  False  False  False  False
      2   True  False  False  False
    • df.notnull,判断DataFrame中有没有非空值,有非空值返回True,没有返回False
      >>> df.notna()
           age   born  name    toy
      0   True  False  True  False
      1   True   True  True   True
      2  False   True  True   True
    • df.dropna(axis),删除所有包含空值的行或者列,axis=0为行,axis=1为列
      >>> df.dropna() 
         age       born    name        toy
      1  6.0 1939-05-27  Batman  Batmobile
      >>> df 
         age       born    name        toy
      0  5.0        NaT  Alfred       None
      1  6.0 1939-05-27  Batman  Batmobile
      2  NaN 1940-04-25              Joker
    • df.fillna(n),用n来替换DataFrame中的所有空值
      >>> df.fillna('hahaha')
            age                 born    name        toy
      0       5               hahaha  Alfred     hahaha
      1       6  1939-05-27 00:00:00  Batman  Batmobile
      2  hahaha  1940-04-25 00:00:00              Joker
    • df.replace(‘a’,’b’),用’b’来替换DataFrme中所有的’a’
      >>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
      ...                    'B': [5, 6, 7, 8, 9],
      ...                    'C': ['a', 'b', 'c', 'd', 'e']})
      >>> df.replace(0, 5)
         A  B  C
      0  5  5  a
      1  1  6  b
      2  2  7  c
      3  3  8  d
      4  4  9  e
      >>> df.replace(0,5)
         A  B  C
      0  5  5  a
      1  1  6  b
      2  2  7  c
      3  3  8  d
      4  4  9  e
    • df[[‘a’,’b’]].astype(type),改变DataFrme中某几列的数据类型,即改变Series的数据类型
      >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
      >>> df.rename(index=str, columns={"A": "a", "B": "c"})
         a  c
      0  1  4
      1  2  5
      2  3  6
      df = pd.DataFrame([('bird', 389.0),('bird', 24.0),('mammal', 80.5),('mammal', np.nan)],index=['falcon', 'parrot', 'lion', 'monkey'],columns=('class', 'max_speed'))
  • 数据合并

    • df.append(df2),将df2中的数据根据列追加到df中的末尾,注意如果两个df的列名不相同,会显示所有列,在没有的列添加NaN
      >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
      >>> df
         A  B
      0  1  2
      1  3  4
      >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
      >>> df2
         A  B
      0  5  6
      1  7  8
      >>> df.append(df2)
         A  B
      0  1  2
      1  3  4
      0  5  6
      1  7  8
      >>> df3 = pd.DataFrame([[5, 6], [7, 8]], columns=list('BC'))
      >>> df3
         B  C
      0  5  6
      1  7  8
      >>> df.append(df3,sort=True) # sort设置排序规则
           A  B    C
      0  1.0  2  NaN
      1  3.0  4  NaN
      0  NaN  5  6.0
      1  NaN  7  8.0
      >>> df.append(df3,sort=True,ignore_index=True) # ignore_index重新设置索引
           A  B    C
      0  1.0  2  NaN
      1  3.0  4  NaN
      2  NaN  5  6.0
      3  NaN  7  8.0
    • pd.concat([df1,df2],axis=1),将df2中的数据根据axis选择行列追加到df1的尾部
      >>> df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
      >>> df1
        letter  number
      0      a       1
      1      b       2
      >>> df2 = pd.DataFrame([['c', 3], ['d', 4]],columns=['letter', 'number'])
      >>> df2
        letter  number
      0      c       3
      1      d       4
      >>> pd.concat([df1,df2])
        letter  number
      0      a       1
      1      b       2
      0      c       3
      1      d       4
      >>> pd.concat([df1,df2],axis=1)
        letter  number letter  number
      0      a       1      c       3
      1      b       2      d       4
    • df1.join(df2,on=columns,how=’out’)
      >>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
      >>> df
        key   A
      0  K0  A0
      1  K1  A1
      2  K2  A2
      3  K3  A3
      4  K4  A4
      5  K5  A5
      >>> df2 = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})
      >>> df2
        key   B
      0  K0  B0
      1  K1  B1
      2  K2  B2
      >>> df.join(df2,lsuffix='_caller',rsuffix='_other') # lsuffix设置df左侧的重叠列中使用的列名,同理rsuffix为右侧
        key_caller   A key_other    B
      0         K0  A0        K0   B0
      1         K1  A1        K1   B1
      2         K2  A2        K2   B2
      3         K3  A3       NaN  NaN
      4         K4  A4       NaN  NaN
      5         K5  A5       NaN  NaN
    • df1.merge(df2,left_on=’column1’,right_on=’column2’)
      >>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 5]})
      >>> df1
        lkey  value
      0  foo      1
      1  bar      2
      2  baz      3
      3  foo      5
      >>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]})
      >>> df2
        rkey  value
      0  foo      5
      1  bar      6
      2  baz      7
      3  foo      8
      >>> df1.merge(df2,left_on='lkey',right_on='rkey')
        lkey  value_x rkey  value_y
      0  foo        1  foo        5
      1  foo        1  foo        8
      2  foo        5  foo        5
      3  foo        5  foo        8
      4  bar        2  bar        6
      5  baz        3  baz        7

本文章首发在 LearnKu.com 网站上。

上一篇 下一篇
讨论数量: 0
发起讨论 只看当前版本


暂无话题~