本文共 61963 字,大约阅读时间需要 206 分钟。
本文参考链接:
Pandas是一款开放源码的 BSD 许可的 Python 库,为 Python 编程语言提供了高性能,易于使用的数据结构和数据分析工具。广泛的用于包括金融,经济,统计,分析等学术和商业领域。其数据类型很容易转换成我们常用的文件格式,比如 csv,excel,json,html 等。
# 从字典创建>>> data = {'a':0, 'b':1, 'c':2}>>> data{'a': 0, 'b': 1, 'c': 2}>>> pd.Series(data)a 0b 1c 2# 从列表创建>>> data = [18, 30, 25, 40]>>> user_age = pd.Series(data)>>> user_age0 181 302 253 40dtype: int64
>>> user_age.index = ["Tom", "Bob", "Mary", "James"]>>> user_ageTom 18Bob 30Mary 25James 40dtype: int64
>>> user_age.name = 'user_age'>>> user_ageTom 18Bob 30Mary 25James 40Name: user_age, dtype: int64>>> user_age.index.name = 'name'>>> user_agenameTom 18Bob 30Mary 25James 40Name: user_age, dtype: int64
>>> user_age.ndim1>>> user_age.size4>>> user_age.indexIndex(['Tom', 'Bob', 'Mary', 'James'], dtype='object')>>> user_age.valuesarray([18, 30, 25, 40])>>> user_age.head(2)nameTom 18Bob 30Name: user_age, dtype: int64>>> user_age.tail(2)nameMary 25James 40Name: user_age, dtype: int64>>> user_age[-2:]nameMary 25James 40Name: user_age, dtype: int64>>> user_age.axes[Index(['Tom', 'Bob', 'Mary', 'James'], dtype='object', name='name')]>>> user_age.emptyFalse>>> user_age + 1Tom 19Bob 31Mary 26James 41dtype: int64
>>> user_age.append(pd.Series({'Looking': 26}))Tom 18Bob 30Mary 25James 40Looking 26dtype: int64>>> user_ageTom 18Bob 30Mary 25James 40dtype: int64>>> user_age = user_age.append(pd.Series({'Looking': 26}))
>>> user_age.at['Bob'] = 31>>> user_ageTom 18Bob 31Mary 25James 40Looking 26dtype: int64
>>> user_age.drop('Looking')Tom 18Bob 31Mary 25James 40dtype: int64>>> user_age = user_age.drop('Looking')
>>> user_ageTom 18Bob 31Mary 25James 40dtype: int64>>> user_age.sort_values()Tom 18Mary 25Bob 31James 40dtype: int64
# 从字典创建>>> import pandas as pd>>> index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")>>> data = {... "age": [18, 30, 25, 40],... "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]... }>>> user_info = pd.DataFrame(data=data, index=index)>>> user_info age cityname Tom 18 BeiJingBob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen# 从列表创建>>> data = [[18, "BeiJing"], ... [30, "ShangHai"], ... [25, "GuangZhou"], ... [40, "ShenZhen"]]>>> columns = ["age", "city"]>>> index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")>>> user_info = pd.DataFrame(data=data, index=index, columns=columns)>>> user_info age cityname Tom 18 BeiJingBob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen
>>> user_info age cityname Tom 18 BeiJingBob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen>>> user_info.info()Index: 4 entries, Tom to JamesData columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 4 non-null int64 1 city 4 non-null objectdtypes: int64(1), object(1)memory usage: 96.0+ bytes>>> user_info.shape(4, 2)>>> user_info.ndim2>>> user_info.size8>>> user_info.emptyFalse>>> user_info.indexIndex(['Tom', 'Bob', 'Mary', 'James'], dtype='object', name='name')>>> user_info.columnsIndex(['age', 'city'], dtype='object')
>>> user_info age cityname Tom 18 BeiJingBob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen# 获取行>>> user_info.loc['Bob']age 30city ShangHaiName: Bob, dtype: object# 获取多行>>> user_info[1:3] age cityname Bob 30 ShangHaiMary 25 GuangZhou# 获取列>>> user_info['city']nameTom BeiJingBob ShangHaiMary GuangZhouJames ShenZhenName: city, dtype: object# 获取多列>>> user_info[['city', 'age']] city agename Tom BeiJing 18Bob ShangHai 30Mary GuangZhou 25James ShenZhen 40>>> user_info.head(2) age cityname Tom 18 BeiJingBob 30 ShangHai>>> user_info.tail(2) age cityname Mary 25 GuangZhouJames 40 ShenZhen>>> user_info.valuesarray([[18, 'BeiJing'], [30, 'ShangHai'], [25, 'GuangZhou'], [40, 'ShenZhen']], dtype=object)>>> user_info.Tname Tom Bob Mary Jamesage 18 30 25 40city BeiJing ShangHai GuangZhou ShenZhen
# 新增列 (DataFrame 默认也是列优先的)>>> user_info.loc[:, 'sex'] = ["male", "male", "female", "male"]# or user_info['sex'] = ["male", "male", "female", "male"]>>> user_info age city sexname Tom 18 BeiJing maleBob 30 ShangHai maleMary 25 GuangZhou femaleJames 40 ShenZhen male# 行新增>>> user_info.loc['Lookng',:] = [26, 'Xian', 'male']>>> user_info age city sexname Tom 18.0 BeiJing maleBob 30.0 ShangHai maleMary 25.0 GuangZhou femaleJames 40.0 ShenZhen maleLookng 26.0 Xian male>>> user_info['age'] = user_info['age'].astype("int")>>> user_info age city sexname Tom 18 BeiJing maleBob 30 ShangHai maleMary 25 GuangZhou femaleJames 40 ShenZhen maleLookng 26 Xian male
drop 和 pop 的区别在于 pop 会返回删除的部分,drop 会返回删除后剩余的部分。而且 pop 会修改 DataFrame 本身。
>>> user_info age cityname Tom 18 BeiJingBob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen>>> user_info.drop('city', axis=1) agename Tom 18Bob 30Mary 25James 40>>> user_info.drop('Tom', axis=0) age cityname Bob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen>>> user_info.pop('city')nameTom BeiJingBob ShangHaiMary GuangZhouJames ShenZhenName: city, dtype: object>>> user_info agename Tom 18Bob 30Mary 25James 40
当然,读写 csv 当然可以直接 import csv 模块来进行的,就不多说了,这儿主要看看如何使用 pandas 读写 csv。当然,excel 也是类似的用 .to_excel(),pd.read_excel()。
>>> import pandas as pd>>> data = {... "age": [18, 30, 25, 40],... "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]... }>>> index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")>>> user_info = pd.DataFrame(data=data, index=index)>>> user_info age cityname Tom 18 BeiJingBob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen>>> user_info.to_csv('test.csv')>>> pd.read_csv('test.csv') name age city0 Tom 18 BeiJing1 Bob 30 ShangHai2 Mary 25 GuangZhou3 James 40 ShenZhen# 设定 index=False 以后,索引就不会再写到 csv 里边了。>>> user_info age cityname Tom 18 BeiJingBob 30 ShangHaiMary 25 GuangZhouJames 40 ShenZhen>>> user_info.to_csv('test.csv', index=False)>>> pd.read_csv('test.csv') age city0 18 BeiJing1 30 ShangHai2 25 GuangZhou3 40 ShenZhen
>>> import pandas as pd>>> import numpy as np>>> d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']), 'Age':pd.Series([25,26,25,23,30,29,23]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}>>> df = pd.DataFrame(d)>>> df Age Name Rating0 25 Tom 4.231 26 James 3.242 25 Ricky 3.983 23 Vin 2.564 30 Steve 3.205 29 Minsu 4.606 23 Jack 3.80>>> df.sum()>>> df.sum(0)>>> df.sum(axis=0)Age 181Name TomJamesRickyVinSteveMinsuJackRating 25.61dtype: object>>> df.sum(1)>>> df.sum(axis=1) # 横向按行求和0 29.231 29.242 28.983 25.564 33.205 33.606 26.80dtype: float64>>> df.mean()Age 25.857143Rating 3.658571dtype: float64>>> df.mean(1)0 14.6151 14.6202 14.4903 12.7804 16.6005 16.8006 13.400dtype: float64>>> df.std()Age 2.734262Rating 0.698628dtype: float64>>> df.std(1)0 14.6866081 16.0937502 14.8633853 14.4532634 18.9504625 17.2534056 13.576450dtype: float64数据汇总>>> d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}>>> df = pd.DataFrame(d)>>> df Age Name Rating0 25 Tom 4.231 26 James 3.242 25 Ricky 3.983 23 Vin 2.564 30 Steve 3.205 29 Minsu 4.606 23 Jack 3.807 34 Lee 3.788 40 David 2.989 30 Gasper 4.8010 51 Betina 4.1011 46 Andres 3.65>>> df.describe() Age Ratingcount 12.000000 12.000000mean 31.833333 3.743333std 9.232682 0.661628min 23.000000 2.56000025% 25.000000 3.23000050% 29.500000 3.79000075% 35.500000 4.132500max 51.000000 4.800000>>> df.describe(include=['object']) Namecount 12unique 12top Andresfreq 1>>> df.describe(include='all') Age Name Ratingcount 12.000000 12 12.000000unique NaN 12 NaNtop NaN Andres NaNfreq NaN 1 NaNmean 31.833333 NaN 3.743333std 9.232682 NaN 0.661628min 23.000000 NaN 2.56000025% 25.000000 NaN 3.23000050% 29.500000 NaN 3.79000075% 35.500000 NaN 4.132500max 51.000000 NaN 4.800000>>> df.count()Age 12Name 12Rating 12dtype: int64>>> df.median()Age 29.50Rating 3.79dtype: float64>>> df.min()Age 23Name AndresRating 2.56dtype: object>>> df.max()Age 51Name VinRating 4.8dtype: object>>> df col1 col2 col30 -1.330380 -0.033002 -1.0889041 0.505186 0.561735 0.4937532 -0.370248 -0.064979 -0.8992653 -0.292870 0.070235 -0.4397144 0.842472 -1.402209 -1.085047>>> df + 2 col1 col2 col30 0.669620 1.966998 0.9110961 2.505186 2.561735 2.4937532 1.629752 1.935021 1.1007353 1.707130 2.070235 1.5602864 2.842472 0.597791 0.914953>>> df.apply(np.mean)col1 -0.129168col2 -0.173644col3 -0.603835dtype: float64>>> df.mean()col1 -0.129168col2 -0.173644col3 -0.603835dtype: float64>>> df.apply(np.sum)col1 -0.645839col2 -0.868221col3 -3.019177dtype: float64>>> df.sum()col1 -0.645839col2 -0.868221col3 -3.019177dtype: float64
>>> df = pd.DataFrame({ 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), 'x': np.linspace(0,stop=N-1,num=N), 'y': np.random.rand(N), 'C': np.random.choice(['Low','Medium','High'],N).tolist(), 'D': np.random.normal(100, 10, size=(N)).tolist()})>>> df_reindexed = df.reindex(index=[0, 2, 5], columns=['A', 'C', 'B'])>>> df_reindexed A C B0 2016-01-01 Medium NaN2 2016-01-03 High NaN5 2016-01-06 Medium NaN>>> df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])>>> df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])>>> df1.reindex_like(df2) col1 col2 col30 -0.756714 -0.091658 -0.5049861 1.893803 1.590151 0.9665582 0.697154 -0.443745 0.6090323 -0.595683 -1.611600 -2.4223224 1.793706 0.295709 0.2428925 1.114546 -0.300147 1.2640976 -0.263837 0.049844 0.752559>>> df1 col1 col2 col30 -0.756714 -0.091658 -0.5049861 1.893803 1.590151 0.9665582 0.697154 -0.443745 0.6090323 -0.595683 -1.611600 -2.4223224 1.793706 0.295709 0.2428925 1.114546 -0.300147 1.2640976 -0.263837 0.049844 0.7525597 -1.915601 0.018745 0.3607218 -1.115134 -0.931634 -0.6781499 0.189500 -0.029492 -0.251232>>> df2 col1 col2 col30 -0.482238 -0.608154 -0.5151181 1.165773 -1.895489 0.2928122 1.544655 2.309652 -0.1482523 -0.525149 -0.130873 -0.6805884 -0.598798 0.226034 -0.1813565 -0.162839 0.421589 1.0061416 -1.700640 -1.080568 -0.124622>>> df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])>>> df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])>>> df1 col1 col2 col30 2.264839 -1.782313 0.1692591 0.063059 0.609468 -0.1648462 0.990665 0.626967 -1.0492203 0.520214 -1.924748 0.6158854 -2.077392 0.460057 0.2875205 0.104610 -1.183429 -0.056515>>> df2 col1 col2 col30 1.922189 1.049908 -2.0522551 -0.396077 1.304607 0.042303>>> df2.reindex_like(df1) col1 col2 col30 1.922189 1.049908 -2.0522551 -0.396077 1.304607 0.0423032 NaN NaN NaN3 NaN NaN NaN4 NaN NaN NaN5 NaN NaN NaN>>> df2.reindex_like(df1, method='ffill') col1 col2 col30 1.922189 1.049908 -2.0522551 -0.396077 1.304607 0.0423032 -0.396077 1.304607 0.0423033 -0.396077 1.304607 0.0423034 -0.396077 1.304607 0.0423035 -0.396077 1.304607 0.042303>>> df2.reindex_like(df1, method='ffill', limit=1) col1 col2 col30 1.922189 1.049908 -2.0522551 -0.396077 1.304607 0.0423032 -0.396077 1.304607 0.0423033 NaN NaN NaN4 NaN NaN NaN5 NaN NaN NaN>>> df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])>>> df1 col1 col2 col30 -0.604454 -1.397064 0.0178981 -1.378410 -0.565153 -0.1154772 0.826487 -0.271986 -0.9384853 1.048116 0.729759 -0.7292454 0.301114 -0.246754 1.3336595 -1.245321 0.525994 1.421357>>> df1.rename(index={0:'apple', 1:'banana', 2:'durian'}, columns={'col1':'c1', 'col2':'c2'}) c1 c2 col3apple -0.604454 -1.397064 0.017898banana -1.378410 -0.565153 -0.115477durian 0.826487 -0.271986 -0.9384853 1.048116 0.729759 -0.7292454 0.301114 -0.246754 1.3336595 -1.245321 0.525994 1.421357
>>> df = pd.DataFrame({ 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), 'x': np.linspace(0,stop=N-1,num=N), 'y': np.random.rand(N), 'C': np.random.choice(['Low','Medium','High'],N).tolist(), 'D': np.random.normal(100, 10, size=(N)).tolist()})>>> for col in df: print(col) ACDxy>>> df = pd.DataFrame(np.random.randn(4, 3), columns=['col1', 'col2','col3'])>>> df col1 col2 col30 -2.271038 -1.602723 1.5843431 0.273968 -1.342534 0.8239252 0.263824 0.338190 -0.1403833 -0.576569 -0.851937 0.307322>>> for key, value in df.iteritems(): print(key, value) col1 0 -2.2710381 0.2739682 0.2638243 -0.576569Name: col1, dtype: float64col2 0 -1.6027231 -1.3425342 0.3381903 -0.851937Name: col2, dtype: float64col3 0 1.5843431 0.8239252 -0.1403833 0.307322Name: col3, dtype: float64>>> for key, value in df.iterrows(): print(key, value) 0 col1 -2.271038col2 -1.602723col3 1.584343Name: 0, dtype: float641 col1 0.273968col2 -1.342534col3 0.823925Name: 1, dtype: float642 col1 0.263824col2 0.338190col3 -0.140383Name: 2, dtype: float643 col1 -0.576569col2 -0.851937col3 0.307322Name: 3, dtype: float64>>> for row in df.itertuples(): print(row) Pandas(Index=0, col1=-2.271038005486614, col2=-1.6027226893844446, col3=1.584342563261442)Pandas(Index=1, col1=0.27396784975971405, col2=-1.3425340660159573, col3=0.8239254569906167)Pandas(Index=2, col1=0.2638238925215782, col2=0.33819045967946854, col3=-0.14038253638843384)Pandas(Index=3, col1=-0.5765685283132795, col2=-0.8519366611767535, col3=0.3073223920714522)
>>> unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])>>> unsorted_df col2 col11 -0.581885 -0.2735744 -1.154169 0.2296956 -0.856151 0.1907552 -0.866647 -0.1541893 0.591896 0.6977545 -0.185702 0.7710359 0.268549 1.0689728 0.437690 0.9966800 -0.615079 -0.9403617 1.036863 0.227676>>> sorted_df = unsorted_df.sort_index()>>> sorted_df col2 col10 -0.615079 -0.9403611 -0.581885 -0.2735742 -0.866647 -0.1541893 0.591896 0.6977544 -1.154169 0.2296955 -0.185702 0.7710356 -0.856151 0.1907557 1.036863 0.2276768 0.437690 0.9966809 0.268549 1.068972>>> unsorted_df.sort_index(ascending=False) col2 col19 0.268549 1.0689728 0.437690 0.9966807 1.036863 0.2276766 -0.856151 0.1907555 -0.185702 0.7710354 -1.154169 0.2296953 0.591896 0.6977542 -0.866647 -0.1541891 -0.581885 -0.2735740 -0.615079 -0.940361>>> unsorted_df.sort_index(axis=1) col1 col21 -0.273574 -0.5818854 0.229695 -1.1541696 0.190755 -0.8561512 -0.154189 -0.8666473 0.697754 0.5918965 0.771035 -0.1857029 1.068972 0.2685498 0.996680 0.4376900 -0.940361 -0.6150797 0.227676 1.036863按值排序>>> unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})>>> unsorted_df col1 col20 2 11 1 32 1 23 1 4>>> unsorted_df.sort_values(by='col1') col1 col21 1 32 1 23 1 40 2 1>>> unsorted_df.sort_values(by='col2') col1 col20 2 12 1 21 1 33 1 4
>>> s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234', 'SteveMinsu'])>>> s0 Tom1 William Rick2 John3 Alber@t4 NaN5 12346 SteveMinsudtype: object>>> s.str.lower()0 tom1 william rick2 john3 alber@t4 NaN5 12346 steveminsudtype: object>>> s.str.upper()0 TOM1 WILLIAM RICK2 JOHN3 ALBER@T4 NaN5 12346 STEVEMINSUdtype: object>>> s.str.len()0 3.01 12.02 4.03 7.04 NaN5 4.06 10.0dtype: float64>>> s.str.strip()0 Tom1 William Rick2 John3 Alber@t4 NaN5 12346 SteveMinsudtype: object>>> s.str.split(' ')0 [Tom]1 [William, Rick]2 [John]3 [Alber@t]4 NaN5 [1234]6 [SteveMinsu]dtype: object>>> s.str.cat(sep=' <=> ')'Tom <=> William Rick <=> John <=> Alber@t <=> 1234 <=> SteveMinsu'>>> s.str.get_dummies() # 返回具有one-hot编码的数据帧 1234 Alber@t John SteveMinsu Tom William Rick0 0 0 0 0 1 01 0 0 0 0 0 12 0 0 1 0 0 03 0 1 0 0 0 04 0 0 0 0 0 05 1 0 0 0 0 06 0 0 0 1 0 0>>> s0 Tom1 William Rick2 John3 Alber@t4 NaN5 12346 SteveMinsudtype: object>>> s.str.contains(' ')0 False1 True2 False3 False4 NaN5 False6 Falsedtype: object>>> s.str.replace('@', '$')0 Tom1 William Rick2 John3 Alber$t4 NaN5 12346 SteveMinsudtype: object>>> s.str.repeat(2)0 TomTom1 William RickWilliam Rick2 JohnJohn3 Alber@tAlber@t4 NaN5 123412346 SteveMinsuSteveMinsudtype: object>>> s.str.count('m')0 1.01 1.02 0.03 0.04 NaN5 0.06 0.0dtype: float64>>> s.str.startswith('T')0 True1 False2 False3 False4 NaN5 False6 Falsedtype: object>>> s.str.endswith('t')0 False1 False2 False3 True4 NaN5 False6 Falsedtype: object>>> s.str.find('e')0 -1.01 -1.02 -1.03 3.04 NaN5 -1.06 2.0dtype: float64>>> s.str.findall('e')0 []1 []2 []3 [e]4 NaN5 []6 [e, e]dtype: object>>> s.str.swapcase()0 tOM1 wILLIAM rICK2 jOHN3 aLBER@T4 NaN5 12346 sTEVEmINSUdtype: object>>> s.str.islower()0 False1 False2 False3 False4 NaN5 False6 Falsedtype: object>>> s.str.isupper()0 False1 False2 False3 False4 NaN5 False6 Falsedtype: object>>> s.str.isnumeric()0 False1 False2 False3 False4 NaN5 True6 Falsedtype: object
>>> pd.get_option('display.max_rows')60>>> pd.get_option('display.max_columns')20>>> pd.set_option('display.max_rows', 80)>>> pd.set_option('display.max_columns', 32)>>> pd.get_option('display.max_rows')80>>> pd.get_option('display.max_columns')32>>> pd.reset_option('display.max_rows')>>> pd.reset_option('display.max_columns')>>> pd.get_option('display.max_rows')60>>> pd.get_option('display.max_columns')20>>> pd.describe_option('display.max_rows')display.max_rows : int If max_rows is exceeded, switch to truncate view. Depending on `large_repr`, objects are either centrally truncated or printed as a summary view. 'None' value means unlimited. In case python/IPython is running in a terminal and `large_repr` equals 'truncate' this can be set to 0 and pandas will auto-detect the height of the terminal and print a truncated object which fits the screen height. The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible to do correct auto-detection. [default: 60] [currently: 60]>>> pd.get_option('display.expand_frame_repr')True>>> pd.get_option('display.max_colwidth')50>>> pd.get_option('display.precision')6
>>> df = pd.DataFrame(np.random.randn(8, 4),index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])>>> df A B C Da 0.455194 0.916801 -0.759962 -0.294460b -0.822786 1.264703 -0.647666 0.165351c -2.177918 -0.618071 0.589963 0.507757d -0.962701 0.036545 -0.723706 0.475886e 0.311827 -1.633546 -1.521727 -0.894687f 0.853150 1.930345 -0.404521 -0.048161g 0.070669 0.694768 -1.240335 -0.129544h 0.943131 0.209278 -0.841272 -0.475150>>> df.loc[:,'A']a 0.455194b -0.822786c -2.177918d -0.962701e 0.311827f 0.853150g 0.070669h 0.943131Name: A, dtype: float64>>> df.loc[:,'B']a 0.916801b 1.264703c -0.618071d 0.036545e -1.633546f 1.930345g 0.694768h 0.209278Name: B, dtype: float64>>> df.loc[:,['B', 'C']] B Ca 0.916801 -0.759962b 1.264703 -0.647666c -0.618071 0.589963d 0.036545 -0.723706e -1.633546 -1.521727f 1.930345 -0.404521g 0.694768 -1.240335h 0.209278 -0.841272>>> df.loc[['a', 'b', 'f', 'h'],['B', 'C']] B Ca 0.916801 -0.759962b 1.264703 -0.647666f 1.930345 -0.404521h 0.209278 -0.841272>>> df.loc['a':'d'] A B C Da 0.455194 0.916801 -0.759962 -0.294460b -0.822786 1.264703 -0.647666 0.165351c -2.177918 -0.618071 0.589963 0.507757d -0.962701 0.036545 -0.723706 0.475886>>> df.loc['a'] > 0A TrueB TrueC FalseD FalseName: a, dtype: bool----------------------------------------------------------------------------->>> df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])>>> df A B C D0 -1.110185 -0.158177 1.246568 -0.4362141 0.457405 0.661588 0.859702 -0.0691582 1.327901 0.050303 -0.661015 -1.1827933 -2.835100 0.115826 -0.100475 -1.4280374 0.341655 -0.099397 0.897803 1.4241285 0.416073 -1.304400 1.090493 -0.1687086 -0.271063 0.152941 0.290858 -0.1722207 -0.602104 -0.011596 1.252995 1.198668>>> df.loc[:4] A B C D0 -1.110185 -0.158177 1.246568 -0.4362141 0.457405 0.661588 0.859702 -0.0691582 1.327901 0.050303 -0.661015 -1.1827933 -2.835100 0.115826 -0.100475 -1.4280374 0.341655 -0.099397 0.897803 1.424128>>> df.iloc[:4] A B C D0 -1.110185 -0.158177 1.246568 -0.4362141 0.457405 0.661588 0.859702 -0.0691582 1.327901 0.050303 -0.661015 -1.1827933 -2.835100 0.115826 -0.100475 -1.428037>>> df.iloc[1:5, 2:4] C D1 0.859702 -0.0691582 -0.661015 -1.1827933 -0.100475 -1.4280374 0.897803 1.424128>>> df.iloc[[1, 3, 5], [1, 3]] B D1 0.661588 -0.0691583 0.115826 -1.4280375 -1.304400 -0.168708>>> df.iloc[1:3, :] A B C D1 0.457405 0.661588 0.859702 -0.0691582 1.327901 0.050303 -0.661015 -1.182793>>> df.iloc[:, 1:3] B C0 -0.158177 1.2465681 0.661588 0.8597022 0.050303 -0.6610153 0.115826 -0.1004754 -0.099397 0.8978035 -1.304400 1.0904936 0.152941 0.2908587 -0.011596 1.252995----------------------------------------------------------------------------->>> df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])>>> df.ix[:4] A B C D0 -0.634628 -0.898134 1.614606 1.6945351 0.128295 1.064419 0.989894 -0.6671182 -1.291854 1.586323 -0.858613 -0.4442793 -0.299627 -1.419140 -0.136024 0.9824064 0.140499 -1.414128 1.247044 1.510659>>> df A B C D0 -0.634628 -0.898134 1.614606 1.6945351 0.128295 1.064419 0.989894 -0.6671182 -1.291854 1.586323 -0.858613 -0.4442793 -0.299627 -1.419140 -0.136024 0.9824064 0.140499 -1.414128 1.247044 1.5106595 -1.713574 0.812285 1.236715 -1.7873826 -0.235543 -0.607812 0.667951 0.1046847 0.171486 -0.213113 0.877173 0.259829>>> df.ix[:, 'A']0 -0.6346281 0.1282952 -1.2918543 -0.2996274 0.1404995 -1.7135746 -0.2355437 0.171486Name: A, dtype: float64>>> df['A']0 -0.6346281 0.1282952 -1.2918543 -0.2996274 0.1404995 -1.7135746 -0.2355437 0.171486Name: A, dtype: float64>>> df[['A', 'B']] A B0 -0.634628 -0.8981341 0.128295 1.0644192 -1.291854 1.5863233 -0.299627 -1.4191404 0.140499 -1.4141285 -1.713574 0.8122856 -0.235543 -0.6078127 0.171486 -0.213113>>> df[2:2]Empty DataFrameColumns: [A, B, C, D]Index: []>>> df.A0 -0.6346281 0.1282952 -1.2918543 -0.2996274 0.1404995 -1.7135746 -0.2355437 0.171486Name: A, dtype: float64
>>> import pandas as pd>>> import numpy as np>>> s = pd.Series([1, 2, 3, 4, 5, 4])>>> df = pd.DataFrame(np.random.randn(5, 2))>>> s.pct_change()0 NaN1 1.0000002 0.5000003 0.3333334 0.2500005 -0.200000dtype: float64>>> df.pct_change() 0 10 NaN NaN1 -0.951021 0.1086962 -13.608067 -3.0551533 -1.858201 -1.6795084 1.505558 0.888156>>> s1 = pd.Series(np.random.randn(10))>>> s2 = pd.Series(np.random.randn(10))>>> s10 -1.1835731 -0.0738262 -0.3778423 -0.5527754 -0.5496705 -0.4594346 0.0576067 -2.0242478 -1.2290489 0.560924dtype: float64>>> s20 0.1636771 0.0350142 -2.2010623 0.7724164 -0.1029945 0.0958636 0.6680347 -0.0277608 0.3044279 0.473143dtype: float64>>> s1.cov(s2)0.034697580870772814>>> s1.corr(s2)0.056473055220765137>>> s = pd.Series(np.random.randn(5), index=list('abcde'))>>> sa 1.864940b -0.912708c 2.362840d -0.886362e -1.605373dtype: float64>>> s = pd.Series(np.random.randn(5), index=list('abcde'))>>> sa 0.106402b 0.206073c -1.894801d -0.648935e -0.085949dtype: float64>>> s['d']-0.6489348524171517>>> s['d'] = s['b']>>> sa 0.106402b 0.206073c -1.894801d 0.206073e -0.085949dtype: float64>>> s.rank()a 3.0b 4.5c 1.0d 4.5e 2.0dtype: float64>>> s.rank(ascending=False)a 3.0b 1.5c 5.0d 1.5e 4.0dtype: float64
>>> df = pd.DataFrame(np.random.randn(10, 4),index = pd.date_range('1/1/2020', periods=10),columns = ['A', 'B', 'C', 'D'])>>> df A B C D2020-01-01 0.781472 0.095355 1.133631 2.1087412020-01-02 0.387977 2.193452 0.310114 0.4404752020-01-03 -0.790628 -1.005678 -0.606390 -1.9779002020-01-04 -1.314656 0.166715 0.520299 -0.4401952020-01-05 1.548197 1.037625 -1.340270 -0.8123762020-01-06 -0.484046 -0.134035 1.878507 0.7207182020-01-07 0.777485 0.439963 0.183439 0.0824672020-01-08 -0.071771 0.226365 1.043121 -1.0165712020-01-09 0.972678 -1.225537 0.524640 -0.9448672020-01-10 -1.187026 -0.919100 -0.552567 -0.428873>>> df.rolling(window=3).mean() A B C D2020-01-01 NaN NaN NaN NaN2020-01-02 NaN NaN NaN NaN2020-01-03 0.126274 0.427710 0.279118 0.1904392020-01-04 -0.572436 0.451496 0.074674 -0.6592072020-01-05 -0.185696 0.066221 -0.475454 -1.0768242020-01-06 -0.083502 0.356768 0.352845 -0.1772842020-01-07 0.613879 0.447851 0.240559 -0.0030632020-01-08 0.073889 0.177431 1.035022 -0.0711292020-01-09 0.559464 -0.186403 0.583733 -0.6263242020-01-10 -0.095373 -0.639424 0.338398 -0.796770>>> df['A'][:3]2020-01-01 0.7814722020-01-02 0.3879772020-01-03 -0.790628Freq: D, Name: A, dtype: float64>>> df['A'][:3].mean()0.12627354579916547>>> df['A'][1:4].mean()-0.572435731528301>>> df.expanding(min_periods=3).mean() A B C D2020-01-01 NaN NaN NaN NaN2020-01-02 NaN NaN NaN NaN2020-01-03 0.126274 0.427710 0.279118 0.1904392020-01-04 -0.233959 0.362461 0.339413 0.0327802020-01-05 0.122472 0.497494 0.003477 -0.1362512020-01-06 0.021386 0.392239 0.315982 0.0065772020-01-07 0.129400 0.399057 0.297047 0.0174192020-01-08 0.104254 0.377470 0.390306 -0.1118302020-01-09 0.200745 0.199358 0.405232 -0.2043902020-01-10 0.061968 0.087512 0.309452 -0.226838>>> df['A'][:4].mean()-0.23395888332354467>>> df.ewm(com=0.5).mean() A B C D2020-01-01 0.781472 0.095355 1.133631 2.1087412020-01-02 0.486351 1.668928 0.515993 0.8575422020-01-03 -0.397712 -0.182722 -0.261042 -1.1054572020-01-04 -1.016649 0.053148 0.266363 -0.6564052020-01-05 0.700314 0.712178 -0.809152 -0.7608152020-01-06 -0.090344 0.147261 0.985082 0.2282302020-01-07 0.488473 0.342485 0.450409 0.1310112020-01-08 0.114920 0.265060 0.845611 -0.6341602020-01-09 0.686788 -0.728722 0.631619 -0.8413092020-01-10 -0.562442 -0.855643 -0.157851 -0.566347
>>> df = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range('1/1/2019', periods=10), columns = ['A', 'B', 'C', 'D'])>>> df A B C D2019-01-01 1.191096 -0.840763 -0.163492 -0.1264262019-01-02 0.059333 0.080026 0.446028 -1.0630842019-01-03 -1.630649 -0.193603 0.306099 -0.4827892019-01-04 -0.443204 -0.003095 -0.692355 0.3760972019-01-05 -1.156775 -0.055062 0.347634 -0.5467842019-01-06 -1.703101 -0.839752 -0.201902 -0.5175682019-01-07 -0.872392 -0.186374 -0.754593 0.5821402019-01-08 -1.552499 -0.158559 0.002236 -0.8123262019-01-09 1.205949 0.517547 1.005842 1.2475082019-01-10 -0.903890 -1.144626 0.952863 0.636728>>> df.rolling(window=3, min_periods=1)Rolling [window=3,min_periods=1,center=False,axis=0]>>> r = df.rolling(window=3, min_periods=1)>>> r.aggregate(np.sum) A B C D2019-01-01 1.191096 -0.840763 -0.163492 -0.1264262019-01-02 1.250429 -0.760737 0.282536 -1.1895112019-01-03 -0.380220 -0.954340 0.588635 -1.6722992019-01-04 -2.014520 -0.116672 0.059772 -1.1697762019-01-05 -3.230628 -0.251760 -0.038621 -0.6534752019-01-06 -3.303081 -0.897908 -0.546623 -0.6882552019-01-07 -3.732268 -1.081187 -0.608861 -0.4822122019-01-08 -4.127992 -1.184684 -0.954260 -0.7477542019-01-09 -1.218942 0.172614 0.253485 1.0173222019-01-10 -1.250440 -0.785638 1.960941 1.071911>>> df['A'][1:4].sum()-2.014520142409461>>> df['A'][2:5].sum()-3.230628185736731>>> r['A'].aggregate(np.sum)2019-01-01 1.1910962019-01-02 1.2504292019-01-03 -0.3802202019-01-04 -2.0145202019-01-05 -3.2306282019-01-06 -3.3030812019-01-07 -3.7322682019-01-08 -4.1279922019-01-09 -1.2189422019-01-10 -1.250440Freq: D, Name: A, dtype: float64>>> r[['A', 'B']].aggregate(np.sum) A B2019-01-01 1.191096 -0.8407632019-01-02 1.250429 -0.7607372019-01-03 -0.380220 -0.9543402019-01-04 -2.014520 -0.1166722019-01-05 -3.230628 -0.2517602019-01-06 -3.303081 -0.8979082019-01-07 -3.732268 -1.0811872019-01-08 -4.127992 -1.1846842019-01-09 -1.218942 0.1726142019-01-10 -1.250440 -0.785638>>> r['A'].aggregate([np.sum, np.mean]) sum mean2019-01-01 1.191096 1.1910962019-01-02 1.250429 0.6252142019-01-03 -0.380220 -0.1267402019-01-04 -2.014520 -0.6715072019-01-05 -3.230628 -1.0768762019-01-06 -3.303081 -1.1010272019-01-07 -3.732268 -1.2440892019-01-08 -4.127992 -1.3759972019-01-09 -1.218942 -0.4063142019-01-10 -1.250440 -0.416813>>> r[['A', 'B']].aggregate([np.sum, np.mean]) A B sum mean sum mean2019-01-01 1.191096 1.191096 -0.840763 -0.8407632019-01-02 1.250429 0.625214 -0.760737 -0.3803692019-01-03 -0.380220 -0.126740 -0.954340 -0.3181132019-01-04 -2.014520 -0.671507 -0.116672 -0.0388912019-01-05 -3.230628 -1.076876 -0.251760 -0.0839202019-01-06 -3.303081 -1.101027 -0.897908 -0.2993032019-01-07 -3.732268 -1.244089 -1.081187 -0.3603962019-01-08 -4.127992 -1.375997 -1.184684 -0.3948952019-01-09 -1.218942 -0.406314 0.172614 0.0575382019-01-10 -1.250440 -0.416813 -0.785638 -0.261879>>> r.aggregate({'A': np.sum, 'B': np.mean}) A B2019-01-01 1.191096 -0.8407632019-01-02 1.250429 -0.3803692019-01-03 -0.380220 -0.3181132019-01-04 -2.014520 -0.0388912019-01-05 -3.230628 -0.0839202019-01-06 -3.303081 -0.2993032019-01-07 -3.732268 -0.3603962019-01-08 -4.127992 -0.3948952019-01-09 -1.218942 0.0575382019-01-10 -1.250440 -0.261879
>>> df one two threea 0.117532 0.514862 -1.887277b NaN NaN NaNc 1.570501 -0.430070 0.344063d NaN NaN NaNe 0.271454 -0.062202 -0.881098f 0.638614 0.362068 0.574669g NaN NaN NaNh 1.639276 -0.018913 -2.221013>>> df['one'].isnull()a Falseb Truec Falsed Truee Falsef Falseg Trueh FalseName: one, dtype: bool>>> df['one'].notnull()a Trueb Falsec Trued Falsee Truef Trueg Falseh TrueName: one, dtype: bool>>> df['one'].sum()4.237377230983332>>> df = pd.DataFrame(index=[0, 1, 2, 3, 4, 5], columns=['one', 'two'])>>> df one two0 NaN NaN1 NaN NaN2 NaN NaN3 NaN NaN4 NaN NaN5 NaN NaN>>> df['one'].sum()0>>> df.fillna(0) one two0 0 01 0 02 0 03 0 04 0 05 0 0>>> df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])>>> df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])>>> df one two threea 1.128926 0.236342 1.026271b NaN NaN NaNc 0.460429 0.563981 0.550005d NaN NaN NaNe 0.005599 0.369428 -0.445125f 0.489827 1.544342 2.316409g NaN NaN NaNh 0.127825 0.664470 0.744669>>> df.fillna(method='pad') one two threea 1.128926 0.236342 1.026271b 1.128926 0.236342 1.026271c 0.460429 0.563981 0.550005d 0.460429 0.563981 0.550005e 0.005599 0.369428 -0.445125f 0.489827 1.544342 2.316409g 0.489827 1.544342 2.316409h 0.127825 0.664470 0.744669>>> df.fillna(method='backfill') one two threea 1.128926 0.236342 1.026271b 0.460429 0.563981 0.550005c 0.460429 0.563981 0.550005d 0.005599 0.369428 -0.445125e 0.005599 0.369428 -0.445125f 0.489827 1.544342 2.316409g 0.127825 0.664470 0.744669h 0.127825 0.664470 0.744669>>> df.dropna() one two threea 1.128926 0.236342 1.026271c 0.460429 0.563981 0.550005e 0.005599 0.369428 -0.445125f 0.489827 1.544342 2.316409h 0.127825 0.664470 0.744669>>> df = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})>>> df one two0 10 10001 20 02 30 303 40 404 50 505 2000 60>>> df.replace({1000:10, 2000:60}) one two0 10 101 20 02 30 303 40 404 50 505 60 60
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}>>> df = pd.DataFrame(ipl_data)>>> df Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20152 863 2 Devils 20143 673 3 Devils 20154 741 3 Kings 20145 812 4 kings 20156 756 1 Kings 20167 788 1 Kings 20178 694 2 Riders 20169 701 4 Royals 201410 804 1 Royals 201511 690 2 Riders 2017>>> df.groupby('Team')>>> df.groupby('Team').groups{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}>>> df.groupby(['Team', 'Year']).groups{('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('Kings', 2014): Int64Index([4], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64')}>>> df Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20152 863 2 Devils 20143 673 3 Devils 20154 741 3 Kings 20145 812 4 kings 20156 756 1 Kings 20167 788 1 Kings 20178 694 2 Riders 20169 701 4 Royals 201410 804 1 Royals 201511 690 2 Riders 2017>>> grouped = df.groupby('Year')>>> for name, group in grouped: print(name) print(group) 2014 Points Rank Team Year0 876 1 Riders 20142 863 2 Devils 20144 741 3 Kings 20149 701 4 Royals 20142015 Points Rank Team Year1 789 2 Riders 20153 673 3 Devils 20155 812 4 kings 201510 804 1 Royals 20152016 Points Rank Team Year6 756 1 Kings 20168 694 2 Riders 20162017 Points Rank Team Year7 788 1 Kings 201711 690 2 Riders 2017>>> df.groupby('Year').get_group(2014) Points Rank Team Year0 876 1 Riders 20142 863 2 Devils 20144 741 3 Kings 20149 701 4 Royals 2014>>> df.groupby('Year')['Points'].agg(np.mean)Year2014 795.252015 769.502016 725.002017 739.00Name: Points, dtype: float64>>> grouped = df.groupby('Team')>>> for name, group in grouped: print(name) print(group) Devils Points Rank Team Year2 863 2 Devils 20143 673 3 Devils 2015Kings Points Rank Team Year4 741 3 Kings 20146 756 1 Kings 20167 788 1 Kings 2017Riders Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20158 694 2 Riders 201611 690 2 Riders 2017Royals Points Rank Team Year9 701 4 Royals 201410 804 1 Royals 2015kings Points Rank Team Year5 812 4 kings 2015>>> df.groupby('Team').agg(np.size) Points Rank YearTeam Devils 2 2 2Kings 3 3 3Riders 4 4 4Royals 2 2 2kings 1 1 1>>> grouped = df.groupby('Team')>>> for name, group in grouped: print(name) print(group) Devils Points Rank Team Year2 863 2 Devils 20143 673 3 Devils 2015Kings Points Rank Team Year4 741 3 Kings 20146 756 1 Kings 20167 788 1 Kings 2017Riders Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20158 694 2 Riders 201611 690 2 Riders 2017Royals Points Rank Team Year9 701 4 Royals 201410 804 1 Royals 2015kings Points Rank Team Year5 812 4 kings 2015>>> df.groupby('Team')['Points'].agg([np.sum, np.mean, np.std]) sum mean stdTeam Devils 1536 768.000000 134.350288Kings 2285 761.666667 24.006943Riders 3049 762.250000 88.567771Royals 1505 752.500000 72.831998kings 812 812.000000 NaN>>> df.groupby('Team').transform(lambda x: (x - x.mean()) / x.std()*10) Points Rank Year0 12.843272 -15.000000 -11.6189501 3.020286 5.000000 -3.8729832 7.071068 -7.071068 -7.0710683 -7.071068 7.071068 7.0710684 -8.608621 11.547005 -10.9108955 NaN NaN NaN6 -2.360428 -5.773503 2.1821797 10.969049 -5.773503 8.7287168 -7.705963 5.000000 3.8729839 -7.071068 7.071068 -7.07106810 7.071068 -7.071068 7.07106811 -8.157595 5.000000 11.618950>>> df.groupby('Team').filter(lambda x: len(x) >= 3) # 参加3次数以上的队伍 Points Rank Team Year0 876 1 Riders 20141 789 2 Riders 20154 741 3 Kings 20146 756 1 Kings 20167 788 1 Kings 20178 694 2 Riders 201611 690 2 Riders 2017
>>> left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']})>>> left Name id subject_id0 Alex 1 sub11 Amy 2 sub22 Allen 3 sub43 Alice 4 sub64 Ayoung 5 sub5>>> right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']})>>> right Name id subject_id0 Billy 1 sub21 Brian 2 sub42 Bran 3 sub33 Bryce 4 sub64 Betty 5 sub5>>> pd.merge(left, right, on='id') Name_x id subject_id_x Name_y subject_id_y0 Alex 1 sub1 Billy sub21 Amy 2 sub2 Brian sub42 Allen 3 sub4 Bran sub33 Alice 4 sub6 Bryce sub64 Ayoung 5 sub5 Betty sub5>>> pd.merge(left, right, on=['id', 'subject_id']) Name_x id subject_id Name_y0 Alice 4 sub6 Bryce1 Ayoung 5 sub5 Betty>>> pd.merge(left, right, on='subject_id', how='left') Name_x id_x subject_id Name_y id_y0 Alex 1 sub1 NaN NaN1 Amy 2 sub2 Billy 1.02 Allen 3 sub4 Brian 2.03 Alice 4 sub6 Bryce 4.04 Ayoung 5 sub5 Betty 5.0>>> pd.merge(left, right, on='subject_id', how='right') Name_x id_x subject_id Name_y id_y0 Amy 2.0 sub2 Billy 11 Allen 3.0 sub4 Brian 22 Alice 4.0 sub6 Bryce 43 Ayoung 5.0 sub5 Betty 54 NaN NaN sub3 Bran 3>>> pd.merge(left, right, on='subject_id', how='outer') Name_x id_x subject_id Name_y id_y0 Alex 1.0 sub1 NaN NaN1 Amy 2.0 sub2 Billy 1.02 Allen 3.0 sub4 Brian 2.03 Alice 4.0 sub6 Bryce 4.04 Ayoung 5.0 sub5 Betty 5.05 NaN NaN sub3 Bran 3.0>>> pd.merge(left, right, on='subject_id', how='inner') Name_x id_x subject_id Name_y id_y0 Amy 2 sub2 Billy 11 Allen 3 sub4 Brian 22 Alice 4 sub6 Bryce 43 Ayoung 5 sub5 Betty 5
>>> import numpy as np>>> import pandas as pd>>> one = pd.DataFrame({ 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,78]}, index=[1,2,3,4,5])>>> two = pd.DataFrame({ 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5'], 'Marks_scored':[89,80,79,97,88]}, index=[1,2,3,4,5])>>> one Marks_scored Name subject_id1 98 Alex sub12 90 Amy sub23 87 Allen sub44 69 Alice sub65 78 Ayoung sub5>>> two Marks_scored Name subject_id1 89 Billy sub22 80 Brian sub43 79 Bran sub34 97 Bryce sub65 88 Betty sub5>>> rs = pd.concat([one, two])>>> rs Marks_scored Name subject_id1 98 Alex sub12 90 Amy sub23 87 Allen sub44 69 Alice sub65 78 Ayoung sub51 89 Billy sub22 80 Brian sub43 79 Bran sub34 97 Bryce sub65 88 Betty sub5>>> rs = pd.concat([one, two], keys=['x', 'y'])>>> rs Marks_scored Name subject_idx 1 98 Alex sub1 2 90 Amy sub2 3 87 Allen sub4 4 69 Alice sub6 5 78 Ayoung sub5y 1 89 Billy sub2 2 80 Brian sub4 3 79 Bran sub3 4 97 Bryce sub6 5 88 Betty sub5>>> rs = pd.concat([one, two], keys=['x', 'y'], ignore_index=True)>>> rs Marks_scored Name subject_id0 98 Alex sub11 90 Amy sub22 87 Allen sub43 69 Alice sub64 78 Ayoung sub55 89 Billy sub26 80 Brian sub47 79 Bran sub38 97 Bryce sub69 88 Betty sub5>>> pd.concat([one, two], axis=1) Marks_scored Name subject_id Marks_scored Name subject_id1 98 Alex sub1 89 Billy sub22 90 Amy sub2 80 Brian sub43 87 Allen sub4 79 Bran sub34 69 Alice sub6 97 Bryce sub65 78 Ayoung sub5 88 Betty sub5>>> pd.merge(one, two)Empty DataFrameColumns: [Marks_scored, Name, subject_id]Index: []>>> pd.merge(one, two, on='subject_id') Marks_scored_x Name_x subject_id Marks_scored_y Name_y0 90 Amy sub2 89 Billy1 87 Allen sub4 80 Brian2 69 Alice sub6 97 Bryce3 78 Ayoung sub5 88 Betty>>> pd.concat([one, two], axis=1, join='outer') Marks_scored Name subject_id Marks_scored Name subject_id1 98 Alex sub1 89 Billy sub22 90 Amy sub2 80 Brian sub43 87 Allen sub4 79 Bran sub34 69 Alice sub6 97 Bryce sub65 78 Ayoung sub5 88 Betty sub5>>> one.append(two) Marks_scored Name subject_id1 98 Alex sub12 90 Amy sub23 87 Allen sub44 69 Alice sub65 78 Ayoung sub51 89 Billy sub22 80 Brian sub43 79 Bran sub34 97 Bryce sub65 88 Betty sub5>>> one Marks_scored Name subject_id1 98 Alex sub12 90 Amy sub23 87 Allen sub44 69 Alice sub65 78 Ayoung sub5>>> one.append([two, one, two]) Marks_scored Name subject_id1 98 Alex sub12 90 Amy sub23 87 Allen sub44 69 Alice sub65 78 Ayoung sub51 89 Billy sub22 80 Brian sub43 79 Bran sub34 97 Bryce sub65 88 Betty sub51 98 Alex sub12 90 Amy sub23 87 Allen sub44 69 Alice sub65 78 Ayoung sub51 89 Billy sub22 80 Brian sub43 79 Bran sub34 97 Bryce sub65 88 Betty sub5
>>> pd.date_range("12:00", "23:59", freq="30min").timearray([datetime.time(12, 0), datetime.time(12, 30), datetime.time(13, 0), datetime.time(13, 30), datetime.time(14, 0), datetime.time(14, 30), datetime.time(15, 0), datetime.time(15, 30), datetime.time(16, 0), datetime.time(16, 30), datetime.time(17, 0), datetime.time(17, 30), datetime.time(18, 0), datetime.time(18, 30), datetime.time(19, 0), datetime.time(19, 30), datetime.time(20, 0), datetime.time(20, 30), datetime.time(21, 0), datetime.time(21, 30), datetime.time(22, 0), datetime.time(22, 30), datetime.time(23, 0), datetime.time(23, 30)], dtype=object)>>> pd.date_range("12:00", "23:59", freq="H").timearray([datetime.time(12, 0), datetime.time(13, 0), datetime.time(14, 0), datetime.time(15, 0), datetime.time(16, 0), datetime.time(17, 0), datetime.time(18, 0), datetime.time(19, 0), datetime.time(20, 0), datetime.time(21, 0), datetime.time(22, 0), datetime.time(23, 0)], dtype=object)>>> pd.to_datetime(pd.Series(['Jul 31, 2009', '2019-10-10', None]))0 2009-07-311 2019-10-102 NaTdtype: datetime64[ns]>>> pd.date_range("12:00", "23:59", freq="2H").timearray([datetime.time(12, 0), datetime.time(14, 0), datetime.time(16, 0), datetime.time(18, 0), datetime.time(20, 0), datetime.time(22, 0)], dtype=object)>>> pd.date_range("12:00", "23:59", freq="20min").timearray([datetime.time(12, 0), datetime.time(12, 20), datetime.time(12, 40), datetime.time(13, 0), datetime.time(13, 20), datetime.time(13, 40), datetime.time(14, 0), datetime.time(14, 20), datetime.time(14, 40), datetime.time(15, 0), datetime.time(15, 20), datetime.time(15, 40), datetime.time(16, 0), datetime.time(16, 20), datetime.time(16, 40), datetime.time(17, 0), datetime.time(17, 20), datetime.time(17, 40), datetime.time(18, 0), datetime.time(18, 20), datetime.time(18, 40), datetime.time(19, 0), datetime.time(19, 20), datetime.time(19, 40), datetime.time(20, 0), datetime.time(20, 20), datetime.time(20, 40), datetime.time(21, 0), datetime.time(21, 20), datetime.time(21, 40), datetime.time(22, 0), datetime.time(22, 20), datetime.time(22, 40), datetime.time(23, 0), datetime.time(23, 20), datetime.time(23, 40)], dtype=object)>>> pd.date_range('2011/11/11', periods=5)DatetimeIndex(['2011-11-11', '2011-11-12', '2011-11-13', '2011-11-14', '2011-11-15'], dtype='datetime64[ns]', freq='D')>>> pd.date_range('2011/11/11', periods=5, freq='M')DatetimeIndex(['2011-11-30', '2011-12-31', '2012-01-31', '2012-02-29', '2012-03-31'], dtype='datetime64[ns]', freq='M')>>> pd.bdate_range('2011/11/11', periods=5)DatetimeIndex(['2011-11-11', '2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17'], dtype='datetime64[ns]', freq='B')>>> pd.bdate_range('2020/7/7', periods=5)DatetimeIndex(['2020-07-07', '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-13'], dtype='datetime64[ns]', freq='B')>>> pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')Timedelta('2 days 02:15:30')>>> pd.Timedelta(6, unit='h')Timedelta('0 days 06:00:00')>>> pd.Timedelta(days=2)Timedelta('2 days 00:00:00')>>> s = pd.Series(pd.date_range('2012-1-1', periods=3, freq='D'))>>> td = pd.Series([pd.Timedelta(days=i) for i in range(3)])>>> s0 2012-01-011 2012-01-022 2012-01-03dtype: datetime64[ns]>>> td0 0 days1 1 days2 2 daysdtype: timedelta64[ns]>>> df = pd.DataFrame(dict(A=s, B=td))>>> df A B0 2012-01-01 0 days1 2012-01-02 1 days2 2012-01-03 2 days>>> df['C'] = df['A'] + df['B']>>> df A B C0 2012-01-01 0 days 2012-01-011 2012-01-02 1 days 2012-01-032 2012-01-03 2 days 2012-01-05>>> df['D'] = df['C'] + df['B']>>> df A B C D0 2012-01-01 0 days 2012-01-01 2012-01-011 2012-01-02 1 days 2012-01-03 2012-01-042 2012-01-03 2 days 2012-01-05 2012-01-07>>> df['D'] = df['C'] - df['B']>>> df A B C D0 2012-01-01 0 days 2012-01-01 2012-01-011 2012-01-02 1 days 2012-01-03 2012-01-022 2012-01-03 2 days 2012-01-05 2012-01-03
>>> s = pd.Series(['a', 'b', 'c', 'a'])>>> s0 a1 b2 c3 adtype: object>>> s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')>>> >>> s0 a1 b2 c3 adtype: categoryCategories (3, object): [a, b, c]>>> pd.Categorical(['a', 'b', 'c', 'a'])[a, b, c, a]Categories (3, object): [a, b, c]>>> pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c', 'd'], ['c', 'b', 'a'], ordered=True)[a, b, c, a, b, c, NaN]Categories (3, object): [c < b < a]>>> pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c', 'd'], ordered=True)[a, b, c, a, b, c, d]Categories (4, object): [a < b < c < d]>>> cat = pd.Categorical(['a', 'c', 'c', np.nan], categories=['b', 'a', 'c'])>>> cat[a, c, c, NaN]Categories (3, object): [b, a, c]>>> df = pd.DataFrame({'cat':cat, 's':['a', 'c', 'c', np.nan]})>>> df cat s0 a a1 c c2 c c3 NaN NaN>>> df.describe() cat scount 3 3unique 2 2top c cfreq 2 2>>> df['cat'].describe()count 3unique 2top cfreq 2Name: cat, dtype: object>>> cat.categoriesIndex(['b', 'a', 'c'], dtype='object')>>> cat.orderedFalse>>> s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')>>> s.cat.categories = ["Group %s" % g for g in s.cat.categories]>>> s.cat.categoriesIndex(['Group a', 'Group b', 'Group c'], dtype='object')>>> s0 Group a1 Group b2 Group c3 Group adtype: categoryCategories (3, object): [Group a, Group b, Group c]>>> s = s.cat.add_categories([4])>>> s.cat.categoriesIndex(['Group a', 'Group b', 'Group c', 4], dtype='object')>>> s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')>>> s.cat.remove_categories('a')0 NaN1 b2 c3 NaNdtype: categoryCategories (2, object): [b, c]>>> cat = pd.Series([1,2,3]).astype("category", categories=[1,2,3], ordered=True)>>> cat1 = pd.Series([2,2,2]).astype("category", categories=[1,2,3], ordered=True)>>> cat > cat10 False1 False2 Truedtype: bool
>>> import numpy as np>>> import pandas as pd>>> df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('2018/12/18', periods=10), columns=list('ABCD'))>>> df A B C D2018-12-18 0.584660 -1.210248 -0.003870 -0.8774612018-12-19 0.688778 0.589648 1.785282 1.0861732018-12-20 0.414437 1.162100 0.604292 -0.2621462018-12-21 0.783176 0.663370 0.101690 0.6711572018-12-22 -0.428488 0.209358 0.413947 0.9531122018-12-23 -1.260652 -0.491451 -2.068729 -0.4517982018-12-24 0.758227 -0.081586 2.525143 -1.2994842018-12-25 -1.137259 -1.564864 0.369936 0.1648032018-12-26 0.254126 1.003725 -0.603132 -0.0782012018-12-27 -1.891129 0.381141 2.001805 -1.193164>>> df.plot()>>> import matplotlib.pyplot as plt>>> plt.show()
>>> df = pd.DataFrame(np.random.rand(10,4),columns=['a','b','c','d'])>>> df a b c d0 0.223895 0.934126 0.776715 0.2724121 0.054357 0.044082 0.890138 0.6922042 0.603401 0.522171 0.483764 0.2775243 0.851305 0.649713 0.559654 0.6916924 0.453478 0.247265 0.464155 0.7336525 0.257328 0.348949 0.598477 0.9387136 0.073403 0.776095 0.235249 0.1774067 0.255504 0.414014 0.275384 0.2436498 0.704180 0.320717 0.143664 0.0978389 0.770821 0.903387 0.507908 0.510510>>> df.plot.bar()>>> df.plot.bar(stacked=True) >>> df.plot.barh(stacked=True) >>> plt.show()
>>> df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])>>> df A B C D E0 0.128757 0.113134 0.960562 0.232801 0.0153811 0.832828 0.826641 0.668275 0.411818 0.1005982 0.896566 0.870508 0.649730 0.272994 0.1930573 0.993245 0.795654 0.401693 0.062322 0.1817634 0.907362 0.512175 0.226137 0.362590 0.9191715 0.189321 0.117297 0.863777 0.957350 0.6802986 0.377515 0.003974 0.769770 0.483371 0.4777167 0.857534 0.307702 0.466231 0.141965 0.8047608 0.449878 0.676380 0.666671 0.960456 0.0820419 0.841580 0.872631 0.127302 0.386110 0.163839>>> df.plot.box()>>> df.plot.area() >>> df.plot.scatter(x='A', y='B') >>> plt.show()
>>> df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])>>> df xa 0.842394b 0.773727c 1.644047d 2.104003>>> df.plot.pie(subplots=True)array([], dtype=object)>>> plt.show()
>>> import numpy as np>>> import pandas as pd>>> df = pd.read_csv('temp.csv')>>> df S.No Name Age City Salary0 1 Tom 28 Toronto 200001 2 Lee 32 HongKong 30002 3 Steven 43 Bay area 83003 4 Ram 38 Hyderabad 3900>>> df = pd.read_csv('temp.csv', index_col=['S.No'])>>> df Name Age City SalaryS.No 1 Tom 28 Toronto 200002 Lee 32 HongKong 30003 Steven 43 Bay area 83004 Ram 38 Hyderabad 3900>>> pd.read_csv('temp.csv', dtype={'Salary':np.float64}) S.No Name Age City Salary0 1 Tom 28 Toronto 20000.01 2 Lee 32 HongKong 3000.02 3 Steven 43 Bay area 8300.03 4 Ram 38 Hyderabad 3900.0>>> pd.read_csv('temp.csv', names=['a', 'b', 'c', 'd', 'e']) a b c d e0 S.No Name Age City Salary1 1 Tom 28 Toronto 200002 2 Lee 32 HongKong 30003 3 Steven 43 Bay area 83004 4 Ram 38 Hyderabad 3900>>> pd.read_csv('temp.csv', names=['a', 'b', 'c', 'd', 'e'], header=0) a b c d e0 1 Tom 28 Toronto 200001 2 Lee 32 HongKong 30002 3 Steven 43 Bay area 83003 4 Ram 38 Hyderabad 3900>>> pd.read_csv('temp.csv', skiprows=2) 2 Lee 32 HongKong 30000 3 Steven 43 Bay area 83001 4 Ram 38 Hyderabad 3900
>>> ts = pd.Series(np.random.randn(10))>>> ts[2:-2] = np.nan>>> sts = ts.to_sparse()>>> sts0 -0.5851381 0.8938302 NaN3 NaN4 NaN5 NaN6 NaN7 NaN8 0.1355309 -1.513085dtype: float64BlockIndexBlock locations: array([0, 8])Block lengths: array([2, 2])>>> ts0 -0.5851381 0.8938302 NaN3 NaN4 NaN5 NaN6 NaN7 NaN8 0.1355309 -1.513085dtype: float64>>> sts.to_dense()0 -0.5851381 0.8938302 NaN3 NaN4 NaN5 NaN6 NaN7 NaN8 0.1355309 -1.513085dtype: float64>>> sts0 -0.5851381 0.8938302 NaN3 NaN4 NaN5 NaN6 NaN7 NaN8 0.1355309 -1.513085dtype: float64BlockIndexBlock locations: array([0, 8])Block lengths: array([2, 2])>>> sts.density0.4
>>> if pd.Series([False, True, False]):... print('hello')... Traceback (most recent call last): File "", line 1, in File "/usr/local/lib64/python3.6/site-packages/pandas/core/generic.py", line 1330, in __nonzero__ f"The truth value of a {type(self).__name__} is ambiguous. "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().>>> if pd.Series([False, True, False]).any():... print('hello')... hello>>> print(pd.Series([False]).bool())False>>> print(pd.Series([True]).bool())True>>> s = pd.Series(range(5))>>> s0 01 12 23 34 4dtype: int64>>> s == 30 False1 False2 False3 True4 Falsedtype: bool>>> s = pd.Series(['a', 'b', 'c'])>>> s0 a1 b2 cdtype: object>>> s.isin(['a', 'c', 'e'])0 True1 False2 Truedtype: bool
# tips.csvtotal_bill,tip,sex,smoker,day,time,size0,16.99,1.01,Female,No,Sun,Dinner,21,10.34,1.66,Male,No,Sun,Dinner,32,21.01,3.50,Male,No,Sun,Dinner,33,23.68,3.31,Male,No,Sun,Dinner,24,24.59,3.61,Female,No,Sun,Dinner,4
>>> import pandas as pd>>> url = 'tips.csv'>>> tips = pd.read_csv(url)>>> tips.head() total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4>>> tips[['total_bill', 'tip', 'smoker', 'time']].head(3) total_bill tip smoker time0 16.99 1.01 No Dinner1 10.34 1.66 No Dinner2 21.01 3.50 No Dinner>>> tips[tips['time'] == 'Dinner'].head(3) total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 3
>>> tips total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4>>> tips.groupby('sex')>>> tips.groupby('sex').size()sexFemale 2Male 3dtype: int64
转载地址:http://gdjqi.baihongyu.com/