博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python3 下 Pandas 学习笔记
阅读量:4228 次
发布时间:2019-05-26

本文共 61963 字,大约阅读时间需要 206 分钟。

本文参考链接:

Pandas是一款开放源码的 BSD 许可的 Python 库,为 Python 编程语言提供了高性能,易于使用的数据结构和数据分析工具。广泛的用于包括金融,经济,统计,分析等学术和商业领域。其数据类型很容易转换成我们常用的文件格式,比如 csv,excel,json,html 等。

系列的基本功能

数据创建

# 从字典创建>>> data = {'a':0, 'b':1, 'c':2}>>> data{'a': 0, 'b': 1, 'c': 2}>>> pd.Series(data)a    0b    1c    2# 从列表创建>>> data = [18, 30, 25, 40]>>> user_age = pd.Series(data)>>> user_age0    181    302    253    40dtype: int64

添加索引

>>> user_age.index = ["Tom", "Bob", "Mary", "James"]>>> user_ageTom      18Bob      30Mary     25James    40dtype: int64

数据命名

>>> user_age.name = 'user_age'>>> user_ageTom      18Bob      30Mary     25James    40Name: user_age, dtype: int64>>> user_age.index.name = 'name'>>> user_agenameTom      18Bob      30Mary     25James    40Name: user_age, dtype: int64

数据获取

>>> user_age.ndim1>>> user_age.size4>>> user_age.indexIndex(['Tom', 'Bob', 'Mary', 'James'], dtype='object')>>> user_age.valuesarray([18, 30, 25, 40])>>> user_age.head(2)nameTom    18Bob    30Name: user_age, dtype: int64>>> user_age.tail(2)nameMary     25James    40Name: user_age, dtype: int64>>> user_age[-2:]nameMary     25James    40Name: user_age, dtype: int64>>> user_age.axes[Index(['Tom', 'Bob', 'Mary', 'James'], dtype='object', name='name')]>>> user_age.emptyFalse>>> user_age + 1Tom      19Bob      31Mary     26James    41dtype: int64

数据新增

>>> user_age.append(pd.Series({'Looking': 26}))Tom        18Bob        30Mary       25James      40Looking    26dtype: int64>>> user_ageTom      18Bob      30Mary     25James    40dtype: int64>>> user_age = user_age.append(pd.Series({'Looking': 26}))

更新数据 

>>> user_age.at['Bob'] = 31>>> user_ageTom        18Bob        31Mary       25James      40Looking    26dtype: int64

删除数据

>>> user_age.drop('Looking')Tom      18Bob      31Mary     25James    40dtype: int64>>> user_age = user_age.drop('Looking')

排序数据

>>> user_ageTom      18Bob      31Mary     25James    40dtype: int64>>> user_age.sort_values()Tom      18Mary     25Bob      31James    40dtype: int64

数据帧的基本功能

数据创建

# 从字典创建>>> import pandas as pd>>> index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")>>> data = {...     "age": [18, 30, 25, 40],...     "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]... }>>> user_info = pd.DataFrame(data=data, index=index)>>> user_info       age       cityname                 Tom     18    BeiJingBob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen# 从列表创建>>> data = [[18, "BeiJing"], ...         [30, "ShangHai"], ...         [25, "GuangZhou"], ...         [40, "ShenZhen"]]>>> columns = ["age", "city"]>>> index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")>>> user_info = pd.DataFrame(data=data, index=index, columns=columns)>>> user_info       age       cityname                 Tom     18    BeiJingBob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen

数据获取

>>> user_info       age       cityname                 Tom     18    BeiJingBob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen>>> user_info.info()
Index: 4 entries, Tom to JamesData columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 4 non-null int64 1 city 4 non-null objectdtypes: int64(1), object(1)memory usage: 96.0+ bytes>>> user_info.shape(4, 2)>>> user_info.ndim2>>> user_info.size8>>> user_info.emptyFalse>>> user_info.indexIndex(['Tom', 'Bob', 'Mary', 'James'], dtype='object', name='name')>>> user_info.columnsIndex(['age', 'city'], dtype='object')

数据查询

>>> user_info       age       cityname                 Tom     18    BeiJingBob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen# 获取行>>> user_info.loc['Bob']age           30city    ShangHaiName: Bob, dtype: object# 获取多行>>> user_info[1:3]      age       cityname                Bob    30   ShangHaiMary   25  GuangZhou# 获取列>>> user_info['city']nameTom        BeiJingBob       ShangHaiMary     GuangZhouJames     ShenZhenName: city, dtype: object# 获取多列>>> user_info[['city', 'age']]            city  agename                 Tom      BeiJing   18Bob     ShangHai   30Mary   GuangZhou   25James   ShenZhen   40>>> user_info.head(2)      age      cityname               Tom    18   BeiJingBob    30  ShangHai>>> user_info.tail(2)       age       cityname                 Mary    25  GuangZhouJames   40   ShenZhen>>> user_info.valuesarray([[18, 'BeiJing'],       [30, 'ShangHai'],       [25, 'GuangZhou'],       [40, 'ShenZhen']], dtype=object)>>> user_info.Tname      Tom       Bob       Mary     Jamesage        18        30         25        40city  BeiJing  ShangHai  GuangZhou  ShenZhen

数据新增

# 新增列 (DataFrame 默认也是列优先的)>>> user_info.loc[:, 'sex'] = ["male", "male", "female", "male"]# or user_info['sex'] = ["male", "male", "female", "male"]>>> user_info       age       city     sexname                         Tom     18    BeiJing    maleBob     30   ShangHai    maleMary    25  GuangZhou  femaleJames   40   ShenZhen    male# 行新增>>> user_info.loc['Lookng',:] = [26, 'Xian', 'male']>>> user_info         age       city     sexname                           Tom     18.0    BeiJing    maleBob     30.0   ShangHai    maleMary    25.0  GuangZhou  femaleJames   40.0   ShenZhen    maleLookng  26.0       Xian    male>>> user_info['age'] = user_info['age'].astype("int")>>> user_info        age       city     sexname                          Tom      18    BeiJing    maleBob      30   ShangHai    maleMary     25  GuangZhou  femaleJames    40   ShenZhen    maleLookng   26       Xian    male

数据删除

drop 和 pop 的区别在于 pop 会返回删除的部分,drop 会返回删除后剩余的部分。而且 pop 会修改 DataFrame 本身。

>>> user_info       age       cityname                 Tom     18    BeiJingBob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen>>> user_info.drop('city', axis=1)       agename      Tom     18Bob     30Mary    25James   40>>> user_info.drop('Tom', axis=0)       age       cityname                 Bob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen>>> user_info.pop('city')nameTom        BeiJingBob       ShangHaiMary     GuangZhouJames     ShenZhenName: city, dtype: object>>> user_info       agename      Tom     18Bob     30Mary    25James   40

读写 csv

当然,读写 csv 当然可以直接 import csv 模块来进行的,就不多说了,这儿主要看看如何使用 pandas 读写 csv。当然,excel 也是类似的用 .to_excel(),pd.read_excel()。

>>> import pandas as pd>>> data = {...     "age": [18, 30, 25, 40],...     "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]... }>>> index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")>>> user_info = pd.DataFrame(data=data, index=index)>>> user_info       age       cityname                 Tom     18    BeiJingBob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen>>> user_info.to_csv('test.csv')>>> pd.read_csv('test.csv')    name  age       city0    Tom   18    BeiJing1    Bob   30   ShangHai2   Mary   25  GuangZhou3  James   40   ShenZhen# 设定 index=False 以后,索引就不会再写到 csv 里边了。>>> user_info       age       cityname                 Tom     18    BeiJingBob     30   ShangHaiMary    25  GuangZhouJames   40   ShenZhen>>> user_info.to_csv('test.csv', index=False)>>> pd.read_csv('test.csv')   age       city0   18    BeiJing1   30   ShangHai2   25  GuangZhou3   40   ShenZhen

数据帧的统计函数

>>> import pandas as pd>>> import numpy as np>>> d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),   'Age':pd.Series([25,26,25,23,30,29,23]),   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}>>> df = pd.DataFrame(d)>>> df   Age   Name  Rating0   25    Tom    4.231   26  James    3.242   25  Ricky    3.983   23    Vin    2.564   30  Steve    3.205   29  Minsu    4.606   23   Jack    3.80>>> df.sum()>>> df.sum(0)>>> df.sum(axis=0)Age                                  181Name      TomJamesRickyVinSteveMinsuJackRating                             25.61dtype: object>>> df.sum(1)>>> df.sum(axis=1)  # 横向按行求和0    29.231    29.242    28.983    25.564    33.205    33.606    26.80dtype: float64>>> df.mean()Age       25.857143Rating     3.658571dtype: float64>>> df.mean(1)0    14.6151    14.6202    14.4903    12.7804    16.6005    16.8006    13.400dtype: float64>>> df.std()Age       2.734262Rating    0.698628dtype: float64>>> df.std(1)0    14.6866081    16.0937502    14.8633853    14.4532634    18.9504625    17.2534056    13.576450dtype: float64数据汇总>>> d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',   'Lee','David','Gasper','Betina','Andres']),   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}>>> df = pd.DataFrame(d)>>> df    Age    Name  Rating0    25     Tom    4.231    26   James    3.242    25   Ricky    3.983    23     Vin    2.564    30   Steve    3.205    29   Minsu    4.606    23    Jack    3.807    34     Lee    3.788    40   David    2.989    30  Gasper    4.8010   51  Betina    4.1011   46  Andres    3.65>>> df.describe()             Age     Ratingcount  12.000000  12.000000mean   31.833333   3.743333std     9.232682   0.661628min    23.000000   2.56000025%    25.000000   3.23000050%    29.500000   3.79000075%    35.500000   4.132500max    51.000000   4.800000>>> df.describe(include=['object'])          Namecount       12unique      12top     Andresfreq         1>>> df.describe(include='all')              Age    Name     Ratingcount   12.000000      12  12.000000unique        NaN      12        NaNtop           NaN  Andres        NaNfreq          NaN       1        NaNmean    31.833333     NaN   3.743333std      9.232682     NaN   0.661628min     23.000000     NaN   2.56000025%     25.000000     NaN   3.23000050%     29.500000     NaN   3.79000075%     35.500000     NaN   4.132500max     51.000000     NaN   4.800000>>> df.count()Age       12Name      12Rating    12dtype: int64>>> df.median()Age       29.50Rating     3.79dtype: float64>>> df.min()Age           23Name      AndresRating      2.56dtype: object>>> df.max()Age        51Name      VinRating    4.8dtype: object>>> df       col1      col2      col30 -1.330380 -0.033002 -1.0889041  0.505186  0.561735  0.4937532 -0.370248 -0.064979 -0.8992653 -0.292870  0.070235 -0.4397144  0.842472 -1.402209 -1.085047>>> df + 2       col1      col2      col30  0.669620  1.966998  0.9110961  2.505186  2.561735  2.4937532  1.629752  1.935021  1.1007353  1.707130  2.070235  1.5602864  2.842472  0.597791  0.914953>>> df.apply(np.mean)col1   -0.129168col2   -0.173644col3   -0.603835dtype: float64>>> df.mean()col1   -0.129168col2   -0.173644col3   -0.603835dtype: float64>>> df.apply(np.sum)col1   -0.645839col2   -0.868221col3   -3.019177dtype: float64>>> df.sum()col1   -0.645839col2   -0.868221col3   -3.019177dtype: float64

重建索引

>>> df = pd.DataFrame({   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),   'x': np.linspace(0,stop=N-1,num=N),   'y': np.random.rand(N),   'C': np.random.choice(['Low','Medium','High'],N).tolist(),   'D': np.random.normal(100, 10, size=(N)).tolist()})>>> df_reindexed = df.reindex(index=[0, 2, 5], columns=['A', 'C', 'B'])>>> df_reindexed           A       C   B0 2016-01-01  Medium NaN2 2016-01-03    High NaN5 2016-01-06  Medium NaN>>> df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])>>> df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])>>> df1.reindex_like(df2)       col1      col2      col30 -0.756714 -0.091658 -0.5049861  1.893803  1.590151  0.9665582  0.697154 -0.443745  0.6090323 -0.595683 -1.611600 -2.4223224  1.793706  0.295709  0.2428925  1.114546 -0.300147  1.2640976 -0.263837  0.049844  0.752559>>> df1       col1      col2      col30 -0.756714 -0.091658 -0.5049861  1.893803  1.590151  0.9665582  0.697154 -0.443745  0.6090323 -0.595683 -1.611600 -2.4223224  1.793706  0.295709  0.2428925  1.114546 -0.300147  1.2640976 -0.263837  0.049844  0.7525597 -1.915601  0.018745  0.3607218 -1.115134 -0.931634 -0.6781499  0.189500 -0.029492 -0.251232>>> df2       col1      col2      col30 -0.482238 -0.608154 -0.5151181  1.165773 -1.895489  0.2928122  1.544655  2.309652 -0.1482523 -0.525149 -0.130873 -0.6805884 -0.598798  0.226034 -0.1813565 -0.162839  0.421589  1.0061416 -1.700640 -1.080568 -0.124622>>> df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])>>> df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])>>> df1       col1      col2      col30  2.264839 -1.782313  0.1692591  0.063059  0.609468 -0.1648462  0.990665  0.626967 -1.0492203  0.520214 -1.924748  0.6158854 -2.077392  0.460057  0.2875205  0.104610 -1.183429 -0.056515>>> df2       col1      col2      col30  1.922189  1.049908 -2.0522551 -0.396077  1.304607  0.042303>>> df2.reindex_like(df1)       col1      col2      col30  1.922189  1.049908 -2.0522551 -0.396077  1.304607  0.0423032       NaN       NaN       NaN3       NaN       NaN       NaN4       NaN       NaN       NaN5       NaN       NaN       NaN>>> df2.reindex_like(df1, method='ffill')       col1      col2      col30  1.922189  1.049908 -2.0522551 -0.396077  1.304607  0.0423032 -0.396077  1.304607  0.0423033 -0.396077  1.304607  0.0423034 -0.396077  1.304607  0.0423035 -0.396077  1.304607  0.042303>>> df2.reindex_like(df1, method='ffill', limit=1)       col1      col2      col30  1.922189  1.049908 -2.0522551 -0.396077  1.304607  0.0423032 -0.396077  1.304607  0.0423033       NaN       NaN       NaN4       NaN       NaN       NaN5       NaN       NaN       NaN>>> df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])>>> df1       col1      col2      col30 -0.604454 -1.397064  0.0178981 -1.378410 -0.565153 -0.1154772  0.826487 -0.271986 -0.9384853  1.048116  0.729759 -0.7292454  0.301114 -0.246754  1.3336595 -1.245321  0.525994  1.421357>>> df1.rename(index={0:'apple', 1:'banana', 2:'durian'}, columns={'col1':'c1', 'col2':'c2'})              c1        c2      col3apple  -0.604454 -1.397064  0.017898banana -1.378410 -0.565153 -0.115477durian  0.826487 -0.271986 -0.9384853       1.048116  0.729759 -0.7292454       0.301114 -0.246754  1.3336595      -1.245321  0.525994  1.421357

迭代

>>> df = pd.DataFrame({   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),   'x': np.linspace(0,stop=N-1,num=N),   'y': np.random.rand(N),   'C': np.random.choice(['Low','Medium','High'],N).tolist(),   'D': np.random.normal(100, 10, size=(N)).tolist()})>>> for col in df:	print(col)	ACDxy>>> df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2','col3'])>>> df       col1      col2      col30 -2.271038 -1.602723  1.5843431  0.273968 -1.342534  0.8239252  0.263824  0.338190 -0.1403833 -0.576569 -0.851937  0.307322>>> for key, value in df.iteritems():	print(key, value)	col1 0   -2.2710381    0.2739682    0.2638243   -0.576569Name: col1, dtype: float64col2 0   -1.6027231   -1.3425342    0.3381903   -0.851937Name: col2, dtype: float64col3 0    1.5843431    0.8239252   -0.1403833    0.307322Name: col3, dtype: float64>>> for key, value in df.iterrows():	print(key, value)	0 col1   -2.271038col2   -1.602723col3    1.584343Name: 0, dtype: float641 col1    0.273968col2   -1.342534col3    0.823925Name: 1, dtype: float642 col1    0.263824col2    0.338190col3   -0.140383Name: 2, dtype: float643 col1   -0.576569col2   -0.851937col3    0.307322Name: 3, dtype: float64>>> for row in df.itertuples():	print(row)	Pandas(Index=0, col1=-2.271038005486614, col2=-1.6027226893844446, col3=1.584342563261442)Pandas(Index=1, col1=0.27396784975971405, col2=-1.3425340660159573, col3=0.8239254569906167)Pandas(Index=2, col1=0.2638238925215782, col2=0.33819045967946854, col3=-0.14038253638843384)Pandas(Index=3, col1=-0.5765685283132795, col2=-0.8519366611767535, col3=0.3073223920714522)

排序

>>> unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])>>> unsorted_df       col2      col11 -0.581885 -0.2735744 -1.154169  0.2296956 -0.856151  0.1907552 -0.866647 -0.1541893  0.591896  0.6977545 -0.185702  0.7710359  0.268549  1.0689728  0.437690  0.9966800 -0.615079 -0.9403617  1.036863  0.227676>>> sorted_df = unsorted_df.sort_index()>>> sorted_df       col2      col10 -0.615079 -0.9403611 -0.581885 -0.2735742 -0.866647 -0.1541893  0.591896  0.6977544 -1.154169  0.2296955 -0.185702  0.7710356 -0.856151  0.1907557  1.036863  0.2276768  0.437690  0.9966809  0.268549  1.068972>>> unsorted_df.sort_index(ascending=False)       col2      col19  0.268549  1.0689728  0.437690  0.9966807  1.036863  0.2276766 -0.856151  0.1907555 -0.185702  0.7710354 -1.154169  0.2296953  0.591896  0.6977542 -0.866647 -0.1541891 -0.581885 -0.2735740 -0.615079 -0.940361>>> unsorted_df.sort_index(axis=1)       col1      col21 -0.273574 -0.5818854  0.229695 -1.1541696  0.190755 -0.8561512 -0.154189 -0.8666473  0.697754  0.5918965  0.771035 -0.1857029  1.068972  0.2685498  0.996680  0.4376900 -0.940361 -0.6150797  0.227676  1.036863按值排序>>> unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})>>> unsorted_df   col1  col20     2     11     1     32     1     23     1     4>>> unsorted_df.sort_values(by='col1')   col1  col21     1     32     1     23     1     40     2     1>>> unsorted_df.sort_values(by='col2')   col1  col20     2     12     1     21     1     33     1     4

字符串处理

>>> s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234', 'SteveMinsu'])>>> s0             Tom1    William Rick2            John3         Alber@t4             NaN5            12346      SteveMinsudtype: object>>> s.str.lower()0             tom1    william rick2            john3         alber@t4             NaN5            12346      steveminsudtype: object>>> s.str.upper()0             TOM1    WILLIAM RICK2            JOHN3         ALBER@T4             NaN5            12346      STEVEMINSUdtype: object>>> s.str.len()0     3.01    12.02     4.03     7.04     NaN5     4.06    10.0dtype: float64>>> s.str.strip()0             Tom1    William Rick2            John3         Alber@t4             NaN5            12346      SteveMinsudtype: object>>> s.str.split(' ')0              [Tom]1    [William, Rick]2             [John]3          [Alber@t]4                NaN5             [1234]6       [SteveMinsu]dtype: object>>> s.str.cat(sep=' <=> ')'Tom <=> William Rick <=> John <=> Alber@t <=> 1234 <=> SteveMinsu'>>> s.str.get_dummies() # 返回具有one-hot编码的数据帧   1234  Alber@t  John  SteveMinsu  Tom  William Rick0     0        0     0           0    1             01     0        0     0           0    0             12     0        0     1           0    0             03     0        1     0           0    0             04     0        0     0           0    0             05     1        0     0           0    0             06     0        0     0           1    0             0>>> s0             Tom1    William Rick2            John3         Alber@t4             NaN5            12346      SteveMinsudtype: object>>> s.str.contains(' ')0    False1     True2    False3    False4      NaN5    False6    Falsedtype: object>>> s.str.replace('@', '$')0             Tom1    William Rick2            John3         Alber$t4             NaN5            12346      SteveMinsudtype: object>>> s.str.repeat(2)0                      TomTom1    William RickWilliam Rick2                    JohnJohn3              Alber@tAlber@t4                         NaN5                    123412346        SteveMinsuSteveMinsudtype: object>>> s.str.count('m')0    1.01    1.02    0.03    0.04    NaN5    0.06    0.0dtype: float64>>> s.str.startswith('T')0     True1    False2    False3    False4      NaN5    False6    Falsedtype: object>>> s.str.endswith('t')0    False1    False2    False3     True4      NaN5    False6    Falsedtype: object>>> s.str.find('e')0   -1.01   -1.02   -1.03    3.04    NaN5   -1.06    2.0dtype: float64>>> s.str.findall('e')0        []1        []2        []3       [e]4       NaN5        []6    [e, e]dtype: object>>> s.str.swapcase()0             tOM1    wILLIAM rICK2            jOHN3         aLBER@T4             NaN5            12346      sTEVEmINSUdtype: object>>> s.str.islower()0    False1    False2    False3    False4      NaN5    False6    Falsedtype: object>>> s.str.isupper()0    False1    False2    False3    False4      NaN5    False6    Falsedtype: object>>> s.str.isnumeric()0    False1    False2    False3    False4      NaN5     True6    Falsedtype: object

自定义选项

>>> pd.get_option('display.max_rows')60>>> pd.get_option('display.max_columns')20>>> pd.set_option('display.max_rows', 80)>>> pd.set_option('display.max_columns', 32)>>> pd.get_option('display.max_rows')80>>> pd.get_option('display.max_columns')32>>> pd.reset_option('display.max_rows')>>> pd.reset_option('display.max_columns')>>> pd.get_option('display.max_rows')60>>> pd.get_option('display.max_columns')20>>> pd.describe_option('display.max_rows')display.max_rows : int    If max_rows is exceeded, switch to truncate view. Depending on    `large_repr`, objects are either centrally truncated or printed as    a summary view. 'None' value means unlimited.    In case python/IPython is running in a terminal and `large_repr`    equals 'truncate' this can be set to 0 and pandas will auto-detect    the height of the terminal and print a truncated object which fits    the screen height. The IPython notebook, IPython qtconsole, or    IDLE do not run in a terminal and hence it is not possible to do    correct auto-detection.    [default: 60] [currently: 60]>>> pd.get_option('display.expand_frame_repr')True>>> pd.get_option('display.max_colwidth')50>>> pd.get_option('display.precision')6

索引和数据选择

>>> df = pd.DataFrame(np.random.randn(8, 4),index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])>>> df          A         B         C         Da  0.455194  0.916801 -0.759962 -0.294460b -0.822786  1.264703 -0.647666  0.165351c -2.177918 -0.618071  0.589963  0.507757d -0.962701  0.036545 -0.723706  0.475886e  0.311827 -1.633546 -1.521727 -0.894687f  0.853150  1.930345 -0.404521 -0.048161g  0.070669  0.694768 -1.240335 -0.129544h  0.943131  0.209278 -0.841272 -0.475150>>> df.loc[:,'A']a    0.455194b   -0.822786c   -2.177918d   -0.962701e    0.311827f    0.853150g    0.070669h    0.943131Name: A, dtype: float64>>> df.loc[:,'B']a    0.916801b    1.264703c   -0.618071d    0.036545e   -1.633546f    1.930345g    0.694768h    0.209278Name: B, dtype: float64>>> df.loc[:,['B', 'C']]          B         Ca  0.916801 -0.759962b  1.264703 -0.647666c -0.618071  0.589963d  0.036545 -0.723706e -1.633546 -1.521727f  1.930345 -0.404521g  0.694768 -1.240335h  0.209278 -0.841272>>> df.loc[['a', 'b', 'f', 'h'],['B', 'C']]          B         Ca  0.916801 -0.759962b  1.264703 -0.647666f  1.930345 -0.404521h  0.209278 -0.841272>>> df.loc['a':'d']          A         B         C         Da  0.455194  0.916801 -0.759962 -0.294460b -0.822786  1.264703 -0.647666  0.165351c -2.177918 -0.618071  0.589963  0.507757d -0.962701  0.036545 -0.723706  0.475886>>> df.loc['a'] > 0A     TrueB     TrueC    FalseD    FalseName: a, dtype: bool----------------------------------------------------------------------------->>> df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])>>> df          A         B         C         D0 -1.110185 -0.158177  1.246568 -0.4362141  0.457405  0.661588  0.859702 -0.0691582  1.327901  0.050303 -0.661015 -1.1827933 -2.835100  0.115826 -0.100475 -1.4280374  0.341655 -0.099397  0.897803  1.4241285  0.416073 -1.304400  1.090493 -0.1687086 -0.271063  0.152941  0.290858 -0.1722207 -0.602104 -0.011596  1.252995  1.198668>>> df.loc[:4]          A         B         C         D0 -1.110185 -0.158177  1.246568 -0.4362141  0.457405  0.661588  0.859702 -0.0691582  1.327901  0.050303 -0.661015 -1.1827933 -2.835100  0.115826 -0.100475 -1.4280374  0.341655 -0.099397  0.897803  1.424128>>> df.iloc[:4]          A         B         C         D0 -1.110185 -0.158177  1.246568 -0.4362141  0.457405  0.661588  0.859702 -0.0691582  1.327901  0.050303 -0.661015 -1.1827933 -2.835100  0.115826 -0.100475 -1.428037>>> df.iloc[1:5, 2:4]          C         D1  0.859702 -0.0691582 -0.661015 -1.1827933 -0.100475 -1.4280374  0.897803  1.424128>>> df.iloc[[1, 3, 5], [1, 3]]          B         D1  0.661588 -0.0691583  0.115826 -1.4280375 -1.304400 -0.168708>>> df.iloc[1:3, :]          A         B         C         D1  0.457405  0.661588  0.859702 -0.0691582  1.327901  0.050303 -0.661015 -1.182793>>> df.iloc[:, 1:3]          B         C0 -0.158177  1.2465681  0.661588  0.8597022  0.050303 -0.6610153  0.115826 -0.1004754 -0.099397  0.8978035 -1.304400  1.0904936  0.152941  0.2908587 -0.011596  1.252995----------------------------------------------------------------------------->>> df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])>>> df.ix[:4]          A         B         C         D0 -0.634628 -0.898134  1.614606  1.6945351  0.128295  1.064419  0.989894 -0.6671182 -1.291854  1.586323 -0.858613 -0.4442793 -0.299627 -1.419140 -0.136024  0.9824064  0.140499 -1.414128  1.247044  1.510659>>> df          A         B         C         D0 -0.634628 -0.898134  1.614606  1.6945351  0.128295  1.064419  0.989894 -0.6671182 -1.291854  1.586323 -0.858613 -0.4442793 -0.299627 -1.419140 -0.136024  0.9824064  0.140499 -1.414128  1.247044  1.5106595 -1.713574  0.812285  1.236715 -1.7873826 -0.235543 -0.607812  0.667951  0.1046847  0.171486 -0.213113  0.877173  0.259829>>> df.ix[:, 'A']0   -0.6346281    0.1282952   -1.2918543   -0.2996274    0.1404995   -1.7135746   -0.2355437    0.171486Name: A, dtype: float64>>> df['A']0   -0.6346281    0.1282952   -1.2918543   -0.2996274    0.1404995   -1.7135746   -0.2355437    0.171486Name: A, dtype: float64>>> df[['A', 'B']]          A         B0 -0.634628 -0.8981341  0.128295  1.0644192 -1.291854  1.5863233 -0.299627 -1.4191404  0.140499 -1.4141285 -1.713574  0.8122856 -0.235543 -0.6078127  0.171486 -0.213113>>> df[2:2]Empty DataFrameColumns: [A, B, C, D]Index: []>>> df.A0   -0.6346281    0.1282952   -1.2918543   -0.2996274    0.1404995   -1.7135746   -0.2355437    0.171486Name: A, dtype: float64

统计函数

>>> import pandas as pd>>> import numpy as np>>> s = pd.Series([1, 2, 3, 4, 5, 4])>>> df = pd.DataFrame(np.random.randn(5, 2))>>> s.pct_change()0         NaN1    1.0000002    0.5000003    0.3333334    0.2500005   -0.200000dtype: float64>>> df.pct_change()           0         10        NaN       NaN1  -0.951021  0.1086962 -13.608067 -3.0551533  -1.858201 -1.6795084   1.505558  0.888156>>> s1 = pd.Series(np.random.randn(10))>>> s2 = pd.Series(np.random.randn(10))>>> s10   -1.1835731   -0.0738262   -0.3778423   -0.5527754   -0.5496705   -0.4594346    0.0576067   -2.0242478   -1.2290489    0.560924dtype: float64>>> s20    0.1636771    0.0350142   -2.2010623    0.7724164   -0.1029945    0.0958636    0.6680347   -0.0277608    0.3044279    0.473143dtype: float64>>> s1.cov(s2)0.034697580870772814>>> s1.corr(s2)0.056473055220765137>>> s = pd.Series(np.random.randn(5), index=list('abcde'))>>> sa    1.864940b   -0.912708c    2.362840d   -0.886362e   -1.605373dtype: float64>>> s = pd.Series(np.random.randn(5), index=list('abcde'))>>> sa    0.106402b    0.206073c   -1.894801d   -0.648935e   -0.085949dtype: float64>>> s['d']-0.6489348524171517>>> s['d'] = s['b']>>> sa    0.106402b    0.206073c   -1.894801d    0.206073e   -0.085949dtype: float64>>> s.rank()a    3.0b    4.5c    1.0d    4.5e    2.0dtype: float64>>> s.rank(ascending=False)a    3.0b    1.5c    5.0d    1.5e    4.0dtype: float64

窗口函数

>>> df = pd.DataFrame(np.random.randn(10, 4),index = pd.date_range('1/1/2020', periods=10),columns = ['A', 'B', 'C', 'D'])>>> df                   A         B         C         D2020-01-01  0.781472  0.095355  1.133631  2.1087412020-01-02  0.387977  2.193452  0.310114  0.4404752020-01-03 -0.790628 -1.005678 -0.606390 -1.9779002020-01-04 -1.314656  0.166715  0.520299 -0.4401952020-01-05  1.548197  1.037625 -1.340270 -0.8123762020-01-06 -0.484046 -0.134035  1.878507  0.7207182020-01-07  0.777485  0.439963  0.183439  0.0824672020-01-08 -0.071771  0.226365  1.043121 -1.0165712020-01-09  0.972678 -1.225537  0.524640 -0.9448672020-01-10 -1.187026 -0.919100 -0.552567 -0.428873>>> df.rolling(window=3).mean()                   A         B         C         D2020-01-01       NaN       NaN       NaN       NaN2020-01-02       NaN       NaN       NaN       NaN2020-01-03  0.126274  0.427710  0.279118  0.1904392020-01-04 -0.572436  0.451496  0.074674 -0.6592072020-01-05 -0.185696  0.066221 -0.475454 -1.0768242020-01-06 -0.083502  0.356768  0.352845 -0.1772842020-01-07  0.613879  0.447851  0.240559 -0.0030632020-01-08  0.073889  0.177431  1.035022 -0.0711292020-01-09  0.559464 -0.186403  0.583733 -0.6263242020-01-10 -0.095373 -0.639424  0.338398 -0.796770>>> df['A'][:3]2020-01-01    0.7814722020-01-02    0.3879772020-01-03   -0.790628Freq: D, Name: A, dtype: float64>>> df['A'][:3].mean()0.12627354579916547>>> df['A'][1:4].mean()-0.572435731528301>>> df.expanding(min_periods=3).mean()                   A         B         C         D2020-01-01       NaN       NaN       NaN       NaN2020-01-02       NaN       NaN       NaN       NaN2020-01-03  0.126274  0.427710  0.279118  0.1904392020-01-04 -0.233959  0.362461  0.339413  0.0327802020-01-05  0.122472  0.497494  0.003477 -0.1362512020-01-06  0.021386  0.392239  0.315982  0.0065772020-01-07  0.129400  0.399057  0.297047  0.0174192020-01-08  0.104254  0.377470  0.390306 -0.1118302020-01-09  0.200745  0.199358  0.405232 -0.2043902020-01-10  0.061968  0.087512  0.309452 -0.226838>>> df['A'][:4].mean()-0.23395888332354467>>> df.ewm(com=0.5).mean()                   A         B         C         D2020-01-01  0.781472  0.095355  1.133631  2.1087412020-01-02  0.486351  1.668928  0.515993  0.8575422020-01-03 -0.397712 -0.182722 -0.261042 -1.1054572020-01-04 -1.016649  0.053148  0.266363 -0.6564052020-01-05  0.700314  0.712178 -0.809152 -0.7608152020-01-06 -0.090344  0.147261  0.985082  0.2282302020-01-07  0.488473  0.342485  0.450409  0.1310112020-01-08  0.114920  0.265060  0.845611 -0.6341602020-01-09  0.686788 -0.728722  0.631619 -0.8413092020-01-10 -0.562442 -0.855643 -0.157851 -0.566347

聚合函数

>>> df = pd.DataFrame(np.random.randn(10, 4),      index = pd.date_range('1/1/2019', periods=10),      columns = ['A', 'B', 'C', 'D'])>>> df                   A         B         C         D2019-01-01  1.191096 -0.840763 -0.163492 -0.1264262019-01-02  0.059333  0.080026  0.446028 -1.0630842019-01-03 -1.630649 -0.193603  0.306099 -0.4827892019-01-04 -0.443204 -0.003095 -0.692355  0.3760972019-01-05 -1.156775 -0.055062  0.347634 -0.5467842019-01-06 -1.703101 -0.839752 -0.201902 -0.5175682019-01-07 -0.872392 -0.186374 -0.754593  0.5821402019-01-08 -1.552499 -0.158559  0.002236 -0.8123262019-01-09  1.205949  0.517547  1.005842  1.2475082019-01-10 -0.903890 -1.144626  0.952863  0.636728>>> df.rolling(window=3, min_periods=1)Rolling [window=3,min_periods=1,center=False,axis=0]>>> r = df.rolling(window=3, min_periods=1)>>> r.aggregate(np.sum)                   A         B         C         D2019-01-01  1.191096 -0.840763 -0.163492 -0.1264262019-01-02  1.250429 -0.760737  0.282536 -1.1895112019-01-03 -0.380220 -0.954340  0.588635 -1.6722992019-01-04 -2.014520 -0.116672  0.059772 -1.1697762019-01-05 -3.230628 -0.251760 -0.038621 -0.6534752019-01-06 -3.303081 -0.897908 -0.546623 -0.6882552019-01-07 -3.732268 -1.081187 -0.608861 -0.4822122019-01-08 -4.127992 -1.184684 -0.954260 -0.7477542019-01-09 -1.218942  0.172614  0.253485  1.0173222019-01-10 -1.250440 -0.785638  1.960941  1.071911>>> df['A'][1:4].sum()-2.014520142409461>>> df['A'][2:5].sum()-3.230628185736731>>> r['A'].aggregate(np.sum)2019-01-01    1.1910962019-01-02    1.2504292019-01-03   -0.3802202019-01-04   -2.0145202019-01-05   -3.2306282019-01-06   -3.3030812019-01-07   -3.7322682019-01-08   -4.1279922019-01-09   -1.2189422019-01-10   -1.250440Freq: D, Name: A, dtype: float64>>> r[['A', 'B']].aggregate(np.sum)                   A         B2019-01-01  1.191096 -0.8407632019-01-02  1.250429 -0.7607372019-01-03 -0.380220 -0.9543402019-01-04 -2.014520 -0.1166722019-01-05 -3.230628 -0.2517602019-01-06 -3.303081 -0.8979082019-01-07 -3.732268 -1.0811872019-01-08 -4.127992 -1.1846842019-01-09 -1.218942  0.1726142019-01-10 -1.250440 -0.785638>>> r['A'].aggregate([np.sum, np.mean])                 sum      mean2019-01-01  1.191096  1.1910962019-01-02  1.250429  0.6252142019-01-03 -0.380220 -0.1267402019-01-04 -2.014520 -0.6715072019-01-05 -3.230628 -1.0768762019-01-06 -3.303081 -1.1010272019-01-07 -3.732268 -1.2440892019-01-08 -4.127992 -1.3759972019-01-09 -1.218942 -0.4063142019-01-10 -1.250440 -0.416813>>> r[['A', 'B']].aggregate([np.sum, np.mean])                   A                   B                           sum      mean       sum      mean2019-01-01  1.191096  1.191096 -0.840763 -0.8407632019-01-02  1.250429  0.625214 -0.760737 -0.3803692019-01-03 -0.380220 -0.126740 -0.954340 -0.3181132019-01-04 -2.014520 -0.671507 -0.116672 -0.0388912019-01-05 -3.230628 -1.076876 -0.251760 -0.0839202019-01-06 -3.303081 -1.101027 -0.897908 -0.2993032019-01-07 -3.732268 -1.244089 -1.081187 -0.3603962019-01-08 -4.127992 -1.375997 -1.184684 -0.3948952019-01-09 -1.218942 -0.406314  0.172614  0.0575382019-01-10 -1.250440 -0.416813 -0.785638 -0.261879>>> r.aggregate({'A': np.sum, 'B': np.mean})                   A         B2019-01-01  1.191096 -0.8407632019-01-02  1.250429 -0.3803692019-01-03 -0.380220 -0.3181132019-01-04 -2.014520 -0.0388912019-01-05 -3.230628 -0.0839202019-01-06 -3.303081 -0.2993032019-01-07 -3.732268 -0.3603962019-01-08 -4.127992 -0.3948952019-01-09 -1.218942  0.0575382019-01-10 -1.250440 -0.261879

丢失数据处理

>>> df        one       two     threea  0.117532  0.514862 -1.887277b       NaN       NaN       NaNc  1.570501 -0.430070  0.344063d       NaN       NaN       NaNe  0.271454 -0.062202 -0.881098f  0.638614  0.362068  0.574669g       NaN       NaN       NaNh  1.639276 -0.018913 -2.221013>>> df['one'].isnull()a    Falseb     Truec    Falsed     Truee    Falsef    Falseg     Trueh    FalseName: one, dtype: bool>>> df['one'].notnull()a     Trueb    Falsec     Trued    Falsee     Truef     Trueg    Falseh     TrueName: one, dtype: bool>>> df['one'].sum()4.237377230983332>>> df = pd.DataFrame(index=[0, 1, 2, 3, 4, 5], columns=['one', 'two'])>>> df   one  two0  NaN  NaN1  NaN  NaN2  NaN  NaN3  NaN  NaN4  NaN  NaN5  NaN  NaN>>> df['one'].sum()0>>> df.fillna(0)   one  two0    0    01    0    02    0    03    0    04    0    05    0    0>>> df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])>>> df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])>>> df        one       two     threea  1.128926  0.236342  1.026271b       NaN       NaN       NaNc  0.460429  0.563981  0.550005d       NaN       NaN       NaNe  0.005599  0.369428 -0.445125f  0.489827  1.544342  2.316409g       NaN       NaN       NaNh  0.127825  0.664470  0.744669>>> df.fillna(method='pad')        one       two     threea  1.128926  0.236342  1.026271b  1.128926  0.236342  1.026271c  0.460429  0.563981  0.550005d  0.460429  0.563981  0.550005e  0.005599  0.369428 -0.445125f  0.489827  1.544342  2.316409g  0.489827  1.544342  2.316409h  0.127825  0.664470  0.744669>>> df.fillna(method='backfill')        one       two     threea  1.128926  0.236342  1.026271b  0.460429  0.563981  0.550005c  0.460429  0.563981  0.550005d  0.005599  0.369428 -0.445125e  0.005599  0.369428 -0.445125f  0.489827  1.544342  2.316409g  0.127825  0.664470  0.744669h  0.127825  0.664470  0.744669>>> df.dropna()        one       two     threea  1.128926  0.236342  1.026271c  0.460429  0.563981  0.550005e  0.005599  0.369428 -0.445125f  0.489827  1.544342  2.316409h  0.127825  0.664470  0.744669>>> df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})>>> df    one   two0    10  10001    20     02    30    303    40    404    50    505  2000    60>>> df.replace({1000:10, 2000:60})   one  two0   10   101   20    02   30   303   40   404   50   505   60   60

分组

>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}>>> df = pd.DataFrame(ipl_data)>>> df    Points  Rank    Team  Year0      876     1  Riders  20141      789     2  Riders  20152      863     2  Devils  20143      673     3  Devils  20154      741     3   Kings  20145      812     4   kings  20156      756     1   Kings  20167      788     1   Kings  20178      694     2  Riders  20169      701     4  Royals  201410     804     1  Royals  201511     690     2  Riders  2017>>> df.groupby('Team')
>>> df.groupby('Team').groups{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}>>> df.groupby(['Team', 'Year']).groups{('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64')}>>> df Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20152 863 2 Devils 20143 673 3 Devils 20154 741 3 Kings 20145 812 4 kings 20156 756 1 Kings 20167 788 1 Kings 20178 694 2 Riders 20169 701 4 Royals 201410 804 1 Royals 201511 690 2 Riders 2017>>> grouped = df.groupby('Year')>>> for name, group in grouped: print(name) print(group) 2014 Points Rank Team Year0 876 1 Riders 20142 863 2 Devils 20144 741 3 Kings 20149 701 4 Royals 20142015 Points Rank Team Year1 789 2 Riders 20153 673 3 Devils 20155 812 4 kings 201510 804 1 Royals 20152016 Points Rank Team Year6 756 1 Kings 20168 694 2 Riders 20162017 Points Rank Team Year7 788 1 Kings 201711 690 2 Riders 2017>>> df.groupby('Year').get_group(2014) Points Rank Team Year0 876 1 Riders 20142 863 2 Devils 20144 741 3 Kings 20149 701 4 Royals 2014>>> df.groupby('Year')['Points'].agg(np.mean)Year2014 795.252015 769.502016 725.002017 739.00Name: Points, dtype: float64>>> grouped = df.groupby('Team')>>> for name, group in grouped: print(name) print(group) Devils Points Rank Team Year2 863 2 Devils 20143 673 3 Devils 2015Kings Points Rank Team Year4 741 3 Kings 20146 756 1 Kings 20167 788 1 Kings 2017Riders Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20158 694 2 Riders 201611 690 2 Riders 2017Royals Points Rank Team Year9 701 4 Royals 201410 804 1 Royals 2015kings Points Rank Team Year5 812 4 kings 2015>>> df.groupby('Team').agg(np.size) Points Rank YearTeam Devils 2 2 2Kings 3 3 3Riders 4 4 4Royals 2 2 2kings 1 1 1>>> grouped = df.groupby('Team')>>> for name, group in grouped: print(name) print(group) Devils Points Rank Team Year2 863 2 Devils 20143 673 3 Devils 2015Kings Points Rank Team Year4 741 3 Kings 20146 756 1 Kings 20167 788 1 Kings 2017Riders Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20158 694 2 Riders 201611 690 2 Riders 2017Royals Points Rank Team Year9 701 4 Royals 201410 804 1 Royals 2015kings Points Rank Team Year5 812 4 kings 2015>>> df.groupby('Team')['Points'].agg([np.sum, np.mean, np.std]) sum mean stdTeam Devils 1536 768.000000 134.350288Kings 2285 761.666667 24.006943Riders 3049 762.250000 88.567771Royals 1505 752.500000 72.831998kings 812 812.000000 NaN>>> df.groupby('Team').transform(lambda x: (x - x.mean()) / x.std()*10) Points Rank Year0 12.843272 -15.000000 -11.6189501 3.020286 5.000000 -3.8729832 7.071068 -7.071068 -7.0710683 -7.071068 7.071068 7.0710684 -8.608621 11.547005 -10.9108955 NaN NaN NaN6 -2.360428 -5.773503 2.1821797 10.969049 -5.773503 8.7287168 -7.705963 5.000000 3.8729839 -7.071068 7.071068 -7.07106810 7.071068 -7.071068 7.07106811 -8.157595 5.000000 11.618950>>> df.groupby('Team').filter(lambda x: len(x) >= 3) # 参加3次数以上的队伍 Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20154 741 3 Kings 20146 756 1 Kings 20167 788 1 Kings 20178 694 2 Riders 201611 690 2 Riders 2017

DataFrame数据合并连接

>>> left = pd.DataFrame({         'id':[1,2,3,4,5],         'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],         'subject_id':['sub1','sub2','sub4','sub6','sub5']})>>> left     Name  id subject_id0    Alex   1       sub11     Amy   2       sub22   Allen   3       sub43   Alice   4       sub64  Ayoung   5       sub5>>> right = pd.DataFrame(         {'id':[1,2,3,4,5],         'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],         'subject_id':['sub2','sub4','sub3','sub6','sub5']})>>> right    Name  id subject_id0  Billy   1       sub21  Brian   2       sub42   Bran   3       sub33  Bryce   4       sub64  Betty   5       sub5>>> pd.merge(left, right, on='id')   Name_x  id subject_id_x Name_y subject_id_y0    Alex   1         sub1  Billy         sub21     Amy   2         sub2  Brian         sub42   Allen   3         sub4   Bran         sub33   Alice   4         sub6  Bryce         sub64  Ayoung   5         sub5  Betty         sub5>>> pd.merge(left, right, on=['id', 'subject_id'])   Name_x  id subject_id Name_y0   Alice   4       sub6  Bryce1  Ayoung   5       sub5  Betty>>> pd.merge(left, right, on='subject_id', how='left')   Name_x  id_x subject_id Name_y  id_y0    Alex     1       sub1    NaN   NaN1     Amy     2       sub2  Billy   1.02   Allen     3       sub4  Brian   2.03   Alice     4       sub6  Bryce   4.04  Ayoung     5       sub5  Betty   5.0>>> pd.merge(left, right, on='subject_id', how='right')   Name_x  id_x subject_id Name_y  id_y0     Amy   2.0       sub2  Billy     11   Allen   3.0       sub4  Brian     22   Alice   4.0       sub6  Bryce     43  Ayoung   5.0       sub5  Betty     54     NaN   NaN       sub3   Bran     3>>> pd.merge(left, right, on='subject_id', how='outer')   Name_x  id_x subject_id Name_y  id_y0    Alex   1.0       sub1    NaN   NaN1     Amy   2.0       sub2  Billy   1.02   Allen   3.0       sub4  Brian   2.03   Alice   4.0       sub6  Bryce   4.04  Ayoung   5.0       sub5  Betty   5.05     NaN   NaN       sub3   Bran   3.0>>> pd.merge(left, right, on='subject_id', how='inner')   Name_x  id_x subject_id Name_y  id_y0     Amy     2       sub2  Billy     11   Allen     3       sub4  Brian     22   Alice     4       sub6  Bryce     43  Ayoung     5       sub5  Betty     5

级联操作

>>> import numpy as np>>> import pandas as pd>>> one = pd.DataFrame({         'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],         'subject_id':['sub1','sub2','sub4','sub6','sub5'],         'Marks_scored':[98,90,87,69,78]},         index=[1,2,3,4,5])>>> two = pd.DataFrame({         'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],         'subject_id':['sub2','sub4','sub3','sub6','sub5'],         'Marks_scored':[89,80,79,97,88]},         index=[1,2,3,4,5])>>> one   Marks_scored    Name subject_id1            98    Alex       sub12            90     Amy       sub23            87   Allen       sub44            69   Alice       sub65            78  Ayoung       sub5>>> two   Marks_scored   Name subject_id1            89  Billy       sub22            80  Brian       sub43            79   Bran       sub34            97  Bryce       sub65            88  Betty       sub5>>> rs = pd.concat([one, two])>>> rs   Marks_scored    Name subject_id1            98    Alex       sub12            90     Amy       sub23            87   Allen       sub44            69   Alice       sub65            78  Ayoung       sub51            89   Billy       sub22            80   Brian       sub43            79    Bran       sub34            97   Bryce       sub65            88   Betty       sub5>>> rs = pd.concat([one, two], keys=['x', 'y'])>>> rs     Marks_scored    Name subject_idx 1            98    Alex       sub1  2            90     Amy       sub2  3            87   Allen       sub4  4            69   Alice       sub6  5            78  Ayoung       sub5y 1            89   Billy       sub2  2            80   Brian       sub4  3            79    Bran       sub3  4            97   Bryce       sub6  5            88   Betty       sub5>>> rs = pd.concat([one, two], keys=['x', 'y'], ignore_index=True)>>> rs   Marks_scored    Name subject_id0            98    Alex       sub11            90     Amy       sub22            87   Allen       sub43            69   Alice       sub64            78  Ayoung       sub55            89   Billy       sub26            80   Brian       sub47            79    Bran       sub38            97   Bryce       sub69            88   Betty       sub5>>> pd.concat([one, two], axis=1)   Marks_scored    Name subject_id  Marks_scored   Name subject_id1            98    Alex       sub1            89  Billy       sub22            90     Amy       sub2            80  Brian       sub43            87   Allen       sub4            79   Bran       sub34            69   Alice       sub6            97  Bryce       sub65            78  Ayoung       sub5            88  Betty       sub5>>> pd.merge(one, two)Empty DataFrameColumns: [Marks_scored, Name, subject_id]Index: []>>> pd.merge(one, two, on='subject_id')   Marks_scored_x  Name_x subject_id  Marks_scored_y Name_y0              90     Amy       sub2              89  Billy1              87   Allen       sub4              80  Brian2              69   Alice       sub6              97  Bryce3              78  Ayoung       sub5              88  Betty>>> pd.concat([one, two], axis=1, join='outer')   Marks_scored    Name subject_id  Marks_scored   Name subject_id1            98    Alex       sub1            89  Billy       sub22            90     Amy       sub2            80  Brian       sub43            87   Allen       sub4            79   Bran       sub34            69   Alice       sub6            97  Bryce       sub65            78  Ayoung       sub5            88  Betty       sub5>>> one.append(two)   Marks_scored    Name subject_id1            98    Alex       sub12            90     Amy       sub23            87   Allen       sub44            69   Alice       sub65            78  Ayoung       sub51            89   Billy       sub22            80   Brian       sub43            79    Bran       sub34            97   Bryce       sub65            88   Betty       sub5>>> one   Marks_scored    Name subject_id1            98    Alex       sub12            90     Amy       sub23            87   Allen       sub44            69   Alice       sub65            78  Ayoung       sub5>>> one.append([two, one, two])   Marks_scored    Name subject_id1            98    Alex       sub12            90     Amy       sub23            87   Allen       sub44            69   Alice       sub65            78  Ayoung       sub51            89   Billy       sub22            80   Brian       sub43            79    Bran       sub34            97   Bryce       sub65            88   Betty       sub51            98    Alex       sub12            90     Amy       sub23            87   Allen       sub44            69   Alice       sub65            78  Ayoung       sub51            89   Billy       sub22            80   Brian       sub43            79    Bran       sub34            97   Bryce       sub65            88   Betty       sub5

时间序列

>>> pd.date_range("12:00", "23:59", freq="30min").timearray([datetime.time(12, 0), datetime.time(12, 30), datetime.time(13, 0),       datetime.time(13, 30), datetime.time(14, 0), datetime.time(14, 30),       datetime.time(15, 0), datetime.time(15, 30), datetime.time(16, 0),       datetime.time(16, 30), datetime.time(17, 0), datetime.time(17, 30),       datetime.time(18, 0), datetime.time(18, 30), datetime.time(19, 0),       datetime.time(19, 30), datetime.time(20, 0), datetime.time(20, 30),       datetime.time(21, 0), datetime.time(21, 30), datetime.time(22, 0),       datetime.time(22, 30), datetime.time(23, 0), datetime.time(23, 30)],      dtype=object)>>> pd.date_range("12:00", "23:59", freq="H").timearray([datetime.time(12, 0), datetime.time(13, 0), datetime.time(14, 0),       datetime.time(15, 0), datetime.time(16, 0), datetime.time(17, 0),       datetime.time(18, 0), datetime.time(19, 0), datetime.time(20, 0),       datetime.time(21, 0), datetime.time(22, 0), datetime.time(23, 0)],      dtype=object)>>> pd.to_datetime(pd.Series(['Jul 31, 2009', '2019-10-10', None]))0   2009-07-311   2019-10-102          NaTdtype: datetime64[ns]>>> pd.date_range("12:00", "23:59", freq="2H").timearray([datetime.time(12, 0), datetime.time(14, 0), datetime.time(16, 0),       datetime.time(18, 0), datetime.time(20, 0), datetime.time(22, 0)],      dtype=object)>>> pd.date_range("12:00", "23:59", freq="20min").timearray([datetime.time(12, 0), datetime.time(12, 20), datetime.time(12, 40),       datetime.time(13, 0), datetime.time(13, 20), datetime.time(13, 40),       datetime.time(14, 0), datetime.time(14, 20), datetime.time(14, 40),       datetime.time(15, 0), datetime.time(15, 20), datetime.time(15, 40),       datetime.time(16, 0), datetime.time(16, 20), datetime.time(16, 40),       datetime.time(17, 0), datetime.time(17, 20), datetime.time(17, 40),       datetime.time(18, 0), datetime.time(18, 20), datetime.time(18, 40),       datetime.time(19, 0), datetime.time(19, 20), datetime.time(19, 40),       datetime.time(20, 0), datetime.time(20, 20), datetime.time(20, 40),       datetime.time(21, 0), datetime.time(21, 20), datetime.time(21, 40),       datetime.time(22, 0), datetime.time(22, 20), datetime.time(22, 40),       datetime.time(23, 0), datetime.time(23, 20), datetime.time(23, 40)],      dtype=object)>>> pd.date_range('2011/11/11', periods=5)DatetimeIndex(['2011-11-11', '2011-11-12', '2011-11-13', '2011-11-14',               '2011-11-15'],              dtype='datetime64[ns]', freq='D')>>> pd.date_range('2011/11/11', periods=5, freq='M')DatetimeIndex(['2011-11-30', '2011-12-31', '2012-01-31', '2012-02-29',               '2012-03-31'],              dtype='datetime64[ns]', freq='M')>>> pd.bdate_range('2011/11/11', periods=5)DatetimeIndex(['2011-11-11', '2011-11-14', '2011-11-15', '2011-11-16',               '2011-11-17'],              dtype='datetime64[ns]', freq='B')>>> pd.bdate_range('2020/7/7', periods=5)DatetimeIndex(['2020-07-07', '2020-07-08', '2020-07-09', '2020-07-10',               '2020-07-13'],              dtype='datetime64[ns]', freq='B')>>> pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')Timedelta('2 days 02:15:30')>>> pd.Timedelta(6, unit='h')Timedelta('0 days 06:00:00')>>> pd.Timedelta(days=2)Timedelta('2 days 00:00:00')>>> s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))>>> td = pd.Series([pd.Timedelta(days=i) for i in range(3)])>>> s0   2012-01-011   2012-01-022   2012-01-03dtype: datetime64[ns]>>> td0   0 days1   1 days2   2 daysdtype: timedelta64[ns]>>> df = pd.DataFrame(dict(A=s, B=td))>>> df           A      B0 2012-01-01 0 days1 2012-01-02 1 days2 2012-01-03 2 days>>> df['C'] = df['A'] + df['B']>>> df           A      B          C0 2012-01-01 0 days 2012-01-011 2012-01-02 1 days 2012-01-032 2012-01-03 2 days 2012-01-05>>> df['D'] = df['C'] + df['B']>>> df           A      B          C          D0 2012-01-01 0 days 2012-01-01 2012-01-011 2012-01-02 1 days 2012-01-03 2012-01-042 2012-01-03 2 days 2012-01-05 2012-01-07>>> df['D'] = df['C'] - df['B']>>> df           A      B          C          D0 2012-01-01 0 days 2012-01-01 2012-01-011 2012-01-02 1 days 2012-01-03 2012-01-022 2012-01-03 2 days 2012-01-05 2012-01-03

分类

>>> s = pd.Series(['a', 'b', 'c', 'a'])>>> s0    a1    b2    c3    adtype: object>>> s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')>>> >>> s0    a1    b2    c3    adtype: categoryCategories (3, object): [a, b, c]>>> pd.Categorical(['a', 'b', 'c', 'a'])[a, b, c, a]Categories (3, object): [a, b, c]>>> pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c', 'd'], ['c', 'b', 'a'], ordered=True)[a, b, c, a, b, c, NaN]Categories (3, object): [c < b < a]>>> pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c', 'd'], ordered=True)[a, b, c, a, b, c, d]Categories (4, object): [a < b < c < d]>>> cat = pd.Categorical(['a', 'c', 'c', np.nan], categories=['b', 'a', 'c'])>>> cat[a, c, c, NaN]Categories (3, object): [b, a, c]>>> df = pd.DataFrame({'cat':cat, 's':['a', 'c', 'c', np.nan]})>>> df   cat    s0    a    a1    c    c2    c    c3  NaN  NaN>>> df.describe()       cat  scount    3  3unique   2  2top      c  cfreq     2  2>>> df['cat'].describe()count     3unique    2top       cfreq      2Name: cat, dtype: object>>> cat.categoriesIndex(['b', 'a', 'c'], dtype='object')>>> cat.orderedFalse>>> s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')>>> s.cat.categories = ["Group %s" % g for g in s.cat.categories]>>> s.cat.categoriesIndex(['Group a', 'Group b', 'Group c'], dtype='object')>>> s0    Group a1    Group b2    Group c3    Group adtype: categoryCategories (3, object): [Group a, Group b, Group c]>>> s = s.cat.add_categories([4])>>> s.cat.categoriesIndex(['Group a', 'Group b', 'Group c', 4], dtype='object')>>> s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')>>> s.cat.remove_categories('a')0    NaN1      b2      c3    NaNdtype: categoryCategories (2, object): [b, c]>>> cat = pd.Series([1,2,3]).astype("category", categories=[1,2,3], ordered=True)>>> cat1 = pd.Series([2,2,2]).astype("category", categories=[1,2,3], ordered=True)>>> cat > cat10    False1    False2     Truedtype: bool

画图(画图需要调用matplotlib库的plot()函数)

>>> import numpy as np>>> import pandas as pd>>> df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18',   periods=10), columns=list('ABCD'))>>> df                   A         B         C         D2018-12-18  0.584660 -1.210248 -0.003870 -0.8774612018-12-19  0.688778  0.589648  1.785282  1.0861732018-12-20  0.414437  1.162100  0.604292 -0.2621462018-12-21  0.783176  0.663370  0.101690  0.6711572018-12-22 -0.428488  0.209358  0.413947  0.9531122018-12-23 -1.260652 -0.491451 -2.068729 -0.4517982018-12-24  0.758227 -0.081586  2.525143 -1.2994842018-12-25 -1.137259 -1.564864  0.369936  0.1648032018-12-26  0.254126  1.003725 -0.603132 -0.0782012018-12-27 -1.891129  0.381141  2.001805 -1.193164>>> df.plot()
>>> import matplotlib.pyplot as plt>>> plt.show()

                             

>>> df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])>>> df          a         b         c         d0  0.223895  0.934126  0.776715  0.2724121  0.054357  0.044082  0.890138  0.6922042  0.603401  0.522171  0.483764  0.2775243  0.851305  0.649713  0.559654  0.6916924  0.453478  0.247265  0.464155  0.7336525  0.257328  0.348949  0.598477  0.9387136  0.073403  0.776095  0.235249  0.1774067  0.255504  0.414014  0.275384  0.2436498  0.704180  0.320717  0.143664  0.0978389  0.770821  0.903387  0.507908  0.510510>>> df.plot.bar()
>>> df.plot.bar(stacked=True)
>>> df.plot.barh(stacked=True)
>>> plt.show()

                    

                

                   

>>> df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])>>> df          A         B         C         D         E0  0.128757  0.113134  0.960562  0.232801  0.0153811  0.832828  0.826641  0.668275  0.411818  0.1005982  0.896566  0.870508  0.649730  0.272994  0.1930573  0.993245  0.795654  0.401693  0.062322  0.1817634  0.907362  0.512175  0.226137  0.362590  0.9191715  0.189321  0.117297  0.863777  0.957350  0.6802986  0.377515  0.003974  0.769770  0.483371  0.4777167  0.857534  0.307702  0.466231  0.141965  0.8047608  0.449878  0.676380  0.666671  0.960456  0.0820419  0.841580  0.872631  0.127302  0.386110  0.163839>>> df.plot.box()
>>> df.plot.area()
>>> df.plot.scatter(x='A', y='B')
>>> plt.show()

                 

                 

               

>>> df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])>>> df          xa  0.842394b  0.773727c  1.644047d  2.104003>>> df.plot.pie(subplots=True)array([
], dtype=object)>>> plt.show()

                   

IO工具

>>> import numpy as np>>> import pandas as pd>>> df = pd.read_csv('temp.csv')>>> df   S.No    Name  Age       City  Salary0     1     Tom   28    Toronto   200001     2     Lee   32   HongKong    30002     3  Steven   43   Bay area    83003     4     Ram   38  Hyderabad    3900>>> df = pd.read_csv('temp.csv', index_col=['S.No'])>>> df        Name  Age       City  SalaryS.No                                1        Tom   28    Toronto   200002        Lee   32   HongKong    30003     Steven   43   Bay area    83004        Ram   38  Hyderabad    3900>>> pd.read_csv('temp.csv', dtype={'Salary':np.float64})   S.No    Name  Age       City   Salary0     1     Tom   28    Toronto  20000.01     2     Lee   32   HongKong   3000.02     3  Steven   43   Bay area   8300.03     4     Ram   38  Hyderabad   3900.0>>> pd.read_csv('temp.csv', names=['a', 'b', 'c', 'd', 'e'])      a       b    c          d       e0  S.No    Name  Age       City  Salary1     1     Tom   28    Toronto   200002     2     Lee   32   HongKong    30003     3  Steven   43   Bay area    83004     4     Ram   38  Hyderabad    3900>>> pd.read_csv('temp.csv', names=['a', 'b', 'c', 'd', 'e'], header=0)   a       b   c          d      e0  1     Tom  28    Toronto  200001  2     Lee  32   HongKong   30002  3  Steven  43   Bay area   83003  4     Ram  38  Hyderabad   3900>>> pd.read_csv('temp.csv', skiprows=2)   2     Lee  32   HongKong  30000  3  Steven  43   Bay area  83001  4     Ram  38  Hyderabad  3900

                    

稀疏数据 

>>> ts = pd.Series(np.random.randn(10))>>> ts[2:-2] = np.nan>>> sts = ts.to_sparse()>>> sts0   -0.5851381    0.8938302         NaN3         NaN4         NaN5         NaN6         NaN7         NaN8    0.1355309   -1.513085dtype: float64BlockIndexBlock locations: array([0, 8])Block lengths: array([2, 2])>>> ts0   -0.5851381    0.8938302         NaN3         NaN4         NaN5         NaN6         NaN7         NaN8    0.1355309   -1.513085dtype: float64>>> sts.to_dense()0   -0.5851381    0.8938302         NaN3         NaN4         NaN5         NaN6         NaN7         NaN8    0.1355309   -1.513085dtype: float64>>> sts0   -0.5851381    0.8938302         NaN3         NaN4         NaN5         NaN6         NaN7         NaN8    0.1355309   -1.513085dtype: float64BlockIndexBlock locations: array([0, 8])Block lengths: array([2, 2])>>> sts.density0.4

注意事项

>>> if pd.Series([False, True, False]):...     print('hello')... Traceback (most recent call last):  File "
", line 1, in
File "/usr/local/lib64/python3.6/site-packages/pandas/core/generic.py", line 1330, in __nonzero__ f"The truth value of a {type(self).__name__} is ambiguous. "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().>>> if pd.Series([False, True, False]).any():... print('hello')... hello>>> print(pd.Series([False]).bool())False>>> print(pd.Series([True]).bool())True>>> s = pd.Series(range(5))>>> s0 01 12 23 34 4dtype: int64>>> s == 30 False1 False2 False3 True4 Falsedtype: bool>>> s = pd.Series(['a', 'b', 'c'])>>> s0 a1 b2 cdtype: object>>> s.isin(['a', 'c', 'e'])0 True1 False2 Truedtype: bool

类 SQL 操作

# tips.csvtotal_bill,tip,sex,smoker,day,time,size0,16.99,1.01,Female,No,Sun,Dinner,21,10.34,1.66,Male,No,Sun,Dinner,32,21.01,3.50,Male,No,Sun,Dinner,33,23.68,3.31,Male,No,Sun,Dinner,24,24.59,3.61,Female,No,Sun,Dinner,4
>>> import pandas as pd>>> url = 'tips.csv'>>> tips = pd.read_csv(url)>>> tips.head()   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4>>> tips[['total_bill', 'tip', 'smoker', 'time']].head(3)   total_bill   tip smoker    time0       16.99  1.01     No  Dinner1       10.34  1.66     No  Dinner2       21.01  3.50     No  Dinner>>> tips[tips['time'] == 'Dinner'].head(3)   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     3
>>> tips   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4>>> tips.groupby('sex')
>>> tips.groupby('sex').size()sexFemale 2Male 3dtype: int64

 

转载地址:http://gdjqi.baihongyu.com/

你可能感兴趣的文章
AspectJ Cookbook
查看>>
IntelliJ IDEA in Action
查看>>
HTML Professional Projects
查看>>
Python Cookbook, Second Edition
查看>>
Java Extreme Programming Cookbook
查看>>
XSLT Cookbook
查看>>
Java Programming with Oracle JDBC
查看>>
XML, XSLT, Java, and JSP: A Case Study in Developing a Web Application
查看>>
Java Data Access: JDBC, JNDI, and JAXP
查看>>
Using Moodle
查看>>
Concepts, Techniques, and Models of Computer Programming
查看>>
Skills for Managing Rapidly Changing It Projects
查看>>
Designing a Data Warehouse : Supporting Customer Relationship Management
查看>>
The Tomes of Delphi: Algorithms and Data Structures
查看>>
Enterprise SOA: Service-Oriented Architecture Best Practices
查看>>
JavaTech, an Introduction to Scientific and Technical Computing with Java
查看>>
Your Successful Project Management Career
查看>>
Business Process Implementation for IT Professionals and Managers
查看>>
Enterprise Java for SAP
查看>>
Taming Java Threads
查看>>