Pandas-笔记3
Table of Contents
- 1. group by, part-VII
- 1.1. read in the data from gapminder.tsv by setup para
delimiter - 1.2.
Series/DataFrame.aggregate() - 1.3. gropy by + mathematical operation
- 1.3.1. using built-in mathematical function for Series
- 1.3.2. groupby + mean()
- 1.3.3. loc_filter + mean()
- 1.3.4. groupby + describe()
- 1.3.5. using user-define function by
agg(user_defined_fn), other than the built-in - 1.3.6. using function of numpy
- 1.3.7. using ONE user_defined_fn
- 1.3.8. using MULTIPLE user_defined_fn
- 1.4. apply() vs. aggregate() vs. transform
- 1.1. read in the data from gapminder.tsv by setup para
Table of Contents
- 1. group by, part-VII
- 1.1. read in the data from gapminder.tsv by setup para
delimiter - 1.2.
Series/DataFrame.aggregate() - 1.3. gropy by + mathematical operation
- 1.3.1. using built-in mathematical function for Series
- 1.3.2. groupby + mean()
- 1.3.3. loc_filter + mean()
- 1.3.4. groupby + describe()
- 1.3.5. using user-define function by
agg(user_defined_fn), other than the built-in - 1.3.6. using function of numpy
- 1.3.7. using ONE user_defined_fn
- 1.3.8. using MULTIPLE user_defined_fn
- 1.4. apply() vs. aggregate() vs. transform
- 1.1. read in the data from gapminder.tsv by setup para
1 group by, part-VII
1.1 read in the data from gapminder.tsv by setup para delimiter
import pandas as pd file_path = '/home/yiddi/worklap/PythonML/Shan-HungWu DL/Pandas/scipy-2017-tutorial-pandas/data/' gapminder = pd.read_csv(file_path+'gapminder.tsv', delimiter='\t') print ( gapminder )
country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395 9 Afghanistan Asia 1997 41.763 22227415 635.341351 10 Afghanistan Asia 2002 42.129 25268405 726.734055 11 Afghanistan Asia 2007 43.828 31889923 974.580338 12 Albania Europe 1952 55.230 1282697 1601.056136 13 Albania Europe 1957 59.280 1476505 1942.284244 14 Albania Europe 1962 64.820 1728137 2312.888958 15 Albania Europe 1967 66.220 1984060 2760.196931 16 Albania Europe 1972 67.690 2263554 3313.422188 17 Albania Europe 1977 68.930 2509048 3533.003910 18 Albania Europe 1982 70.420 2780097 3630.880722 19 Albania Europe 1987 72.000 3075321 3738.932735 20 Albania Europe 1992 71.581 3326498 2497.437901 21 Albania Europe 1997 72.950 3428038 3193.054604 22 Albania Europe 2002 75.651 3508512 4604.211737 23 Albania Europe 2007 76.423 3600523 5937.029526 24 Algeria Africa 1952 43.077 9279525 2449.008185 25 Algeria Africa 1957 45.685 10270856 3013.976023 26 Algeria Africa 1962 48.303 11000948 2550.816880 27 Algeria Africa 1967 51.407 12760499 3246.991771 28 Algeria Africa 1972 54.518 14760787 4182.663766 29 Algeria Africa 1977 58.014 17152804 4910.416756 ... ... ... ... ... ... 1674 Yemen, Rep. Asia 1982 49.113 9657618 1977.557010 1675 Yemen, Rep. Asia 1987 52.922 11219340 1971.741538 1676 Yemen, Rep. Asia 1992 55.599 13367997 1879.496673 1677 Yemen, Rep. Asia 1997 58.020 15826497 2117.484526 1678 Yemen, Rep. Asia 2002 60.308 18701257 2234.820827 1679 Yemen, Rep. Asia 2007 62.698 22211743 2280.769906 1680 Zambia Africa 1952 42.038 2672000 1147.388831 1681 Zambia Africa 1957 44.077 3016000 1311.956766 1682 Zambia Africa 1962 46.023 3421000 1452.725766 1683 Zambia Africa 1967 47.768 3900000 1777.077318 1684 Zambia Africa 1972 50.107 4506497 1773.498265 1685 Zambia Africa 1977 51.386 5216550 1588.688299 1686 Zambia Africa 1982 51.821 6100407 1408.678565 1687 Zambia Africa 1987 50.821 7272406 1213.315116 1688 Zambia Africa 1992 46.100 8381163 1210.884633 1689 Zambia Africa 1997 40.238 9417789 1071.353818 1690 Zambia Africa 2002 39.193 10595811 1071.613938 1691 Zambia Africa 2007 42.384 11746035 1271.211593 1692 Zimbabwe Africa 1952 48.451 3080907 406.884115 1693 Zimbabwe Africa 1957 50.469 3646340 518.764268 1694 Zimbabwe Africa 1962 52.358 4277736 527.272182 1695 Zimbabwe Africa 1967 53.995 4995432 569.795071 1696 Zimbabwe Africa 1972 55.635 5861135 799.362176 1697 Zimbabwe Africa 1977 57.674 6642107 685.587682 1698 Zimbabwe Africa 1982 60.363 7636524 788.855041 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298 [1704 rows x 6 columns]
1.2 Series/DataFrame.aggregate()
DataFrame.aggregate(func, axis=0, *args, **kwargs) # func: # function # list of functions # dict of column names -> functions
1.2.1 apply on Series
Functions you want to apply, forms a list/dict as para funcs in .aggregate(funcs=...) .
this function will return a DataFrame, each function create a new column in result DataFrame.
if aggregate takse a dict of (key='colName': value=fn), it will apply certain funcion on certain column
if not 'colName' not exist in DataFrame who call aggregate, it will create a new one with column name given by dict.key
The code here, is Series.aggregate({dict}) so all funcions all applied element-wisely on this column, and create 3 column.
multi_agg = gapminder.groupby('continent')['lifeExp'].aggregate({ 'ncount' : np.count_nonzero, 'mean' : np.mean, 'std' : np.std }).rename(columns = {'ncount': 'count', 'mean' : 'avg', 'std' : 'std_dev' }).reset_index() print ( multi_agg )
1.2.2 apply on DataFrame
DataFrame.aggregate({dict}) can apply different function on different row.
1.3 gropy by + mathematical operation
- count
- siez
- mean
- std
- min
- quantile( q=0.25 )
- max
- sum
- var
- sem
- describe <<< : important! This method will give you the [count, mean, std, min, 25%, 50%, 75%, max] of the column value for each row
- first
- last
- nth
1.3.1 using built-in mathematical function for Series
1.3.2 groupby + mean()
print ( type(gapminder.groupby('year')) ) gapminder.groupby('year')['lifeExp'].mean()
1.3.3 loc_filter + mean()
y1952 = gapminder.loc[gapminder['year'] == 1952, :] print ( y1952['lifeExp'] ) print ( y1952['lifeExp'].mean() )
0 28.801
12 55.230
24 43.077
36 30.015
48 62.485
60 69.120
72 66.800
84 50.939
96 37.484
108 68.000
120 38.223
132 40.414
144 53.820
156 47.622
168 50.917
180 59.600
192 31.975
204 39.031
216 39.417
228 38.523
240 68.750
252 35.463
264 38.092
276 54.745
288 44.000
300 50.643
312 40.715
324 39.143
336 42.111
348 57.206
...
1344 30.331
1356 60.396
1368 64.360
1380 65.570
1392 32.978
1404 45.009
1416 64.940
1428 57.593
1440 38.635
1452 41.407
1464 71.860
1476 69.620
1488 45.883
1500 58.500
1512 41.215
1524 50.848
1536 38.596
1548 59.100
1560 44.600
1572 43.585
1584 39.978
1596 69.180
1608 68.440
1620 66.071
1632 55.088
1644 40.412
1656 43.160
1668 32.548
1680 42.038
1692 48.451
Name: lifeExp, Length: 142, dtype: float64
49.057619718309866
1.3.4 groupby + describe()
print ( gapminder.groupby('year')['lifeExp'].describe() )
count mean std min 25% 50% 75% max year 1952 142.0 49.057620 12.225956 28.801 39.05900 45.1355 59.76500 72.670 1957 142.0 51.507401 12.231286 30.332 41.24750 48.3605 63.03675 73.470 1962 142.0 53.609249 12.097245 31.997 43.46850 50.8810 65.23450 73.680 1967 142.0 55.678290 11.718858 34.020 46.03375 53.8250 67.41950 74.160 1972 142.0 57.647386 11.381953 35.400 48.50025 56.5300 69.24750 74.720 1977 142.0 59.570157 11.227229 31.220 50.47550 59.6720 70.38250 76.110 1982 142.0 61.533197 10.770618 38.445 52.94000 62.4415 70.92125 77.110 1987 142.0 63.212613 10.556285 39.906 54.94075 65.8340 71.87725 78.670 1992 142.0 64.160338 11.227380 23.599 56.12175 67.7030 72.58250 79.360 1997 142.0 65.014676 11.559439 36.087 55.63375 69.3940 74.16975 80.690 2002 142.0 65.694923 12.279823 39.193 55.52225 70.8255 75.45925 82.000 2007 142.0 67.007423 12.073021 39.613 57.16025 71.9355 76.41325 82.603
1.3.5 using user-define function by agg(user_defined_fn), other than the built-in
DataFrame.aggregate(func, axis=0, *args, **kwargs) # func: # function # list of functions # dict of column names -> functions
1.3.6 using function of numpy
import numpy as np agg_mean = gapminder.groupby('continent')['lifeExp'].agg(np.mean) agg_std = gapminder.groupby('continent')['lifeExp'].agg(np.mean) print ( agg_mean ) print ( agg_std )
continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64 continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
1.3.7 using ONE user_defined_fn
def my_mean(values): n = len( values ) s= np.sum(values) return s/n my_mean_life = gapminder.groupby('continent')['lifeExp'].agg(my_mean) print ( my_mean_life )
continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
1.3.8 using MULTIPLE user_defined_fn
Functions you want to apply, forms a list/dict as para funcs in .aggregate(funcs=...) .
this function will return a DataFrame, each function create a new column in result DataFrame.
if aggregate takse a dict of (key='colName': value=fn), it will apply certain funcion on certain column
if not 'colName' not exist in DataFrame who call aggregate, it will create a new one with column name given by dict.key
The code here, is Series.aggregate({dict}) so all funcions all applied element-wisely on this column, and create 3 column.
DataFrame.aggregate({dict}) can apply different function on different row.
multi_agg = gapminder.groupby('continent')['lifeExp'].aggregate({ 'ncount' : np.count_nonzero, 'mean' : np.mean, 'std' : np.std }).rename(columns = {'ncount': 'count', 'mean' : 'avg', 'std' : 'std_dev' }).reset_index() print ( multi_agg ) print ( type(multi_agg ) )
continent count avg std_dev 0 Africa 624.0 48.865330 9.150210 1 Americas 300.0 64.658737 9.345088 2 Asia 396.0 60.064903 11.864532 3 Europe 360.0 71.903686 5.433178 4 Oceania 24.0 74.326208 3.795611 <class 'pandas.core.frame.DataFrame'>
1.4 apply() vs. aggregate() vs. transform
| apply(fn) | aggregate(fn) | transform(fn) | |
|---|---|---|---|
| on df | fn on each row/col | <"" | fn on each element |
| on series | fn on each element | <"" | fn on each element |
| on groupedDf | fn on each row/col of each groupe | <"" | fn on each element |
| on groupedSeries | fn on each element of row of group | <"" | fn on each element |
1.4.1 apply vs. aggregate
apply() and aggregate() are used on groupedDataFrame or groupedSeries
apply() vs. aggregate()
df -> groupby -> groupedDf -> ['colName'] -> groupedCol -> ~apply(fn)~ =>
Series(
fn(colGrp1),
fn(colGrp2),
fn(colGrp3),
...
)
df -> groupby -> groupedDf -> ['colName'] -> groupedCol -> ~aggregate({fn1, fn2})~ =>
DataFrame(
series( fn1(colGrp1), fn2(colGrp2), ...),
series( fn2(colGrp1), fn2(colGrp2), ...)
)
1.4.2 apply vs. transform
dfx = pd.DataFrame( np.random.randn(10, 3), columns=['A', 'B', 'C'], index = pd.date_range('1/1/2000', periods=10)) transformed_dfx = dfx['A'].transform(lambda x: (x-x.mean())/x.std()) print ( transformed_dfx)
2000-01-01 0.649323 2000-01-02 1.490414 2000-01-03 0.782475 2000-01-04 -0.204436 2000-01-05 -1.812279 2000-01-06 -1.231838 2000-01-07 0.240099 2000-01-08 -0.694474 2000-01-09 0.558441 2000-01-10 0.222276 Freq: D, Name: A, dtype: float64