UP | HOME

Pandas-笔记3

Table of Contents

Table of Contents

1 group by, part-VII

1.1 read in the data from gapminder.tsv by setup para delimiter

import pandas as pd
file_path = '/home/yiddi/worklap/PythonML/Shan-HungWu DL/Pandas/scipy-2017-tutorial-pandas/data/'
gapminder = pd.read_csv(file_path+'gapminder.tsv', delimiter='\t')
print ( gapminder )
country continent  year  lifeExp       pop    gdpPercap
0     Afghanistan      Asia  1952   28.801   8425333   779.445314
1     Afghanistan      Asia  1957   30.332   9240934   820.853030
2     Afghanistan      Asia  1962   31.997  10267083   853.100710
3     Afghanistan      Asia  1967   34.020  11537966   836.197138
4     Afghanistan      Asia  1972   36.088  13079460   739.981106
5     Afghanistan      Asia  1977   38.438  14880372   786.113360
6     Afghanistan      Asia  1982   39.854  12881816   978.011439
7     Afghanistan      Asia  1987   40.822  13867957   852.395945
8     Afghanistan      Asia  1992   41.674  16317921   649.341395
9     Afghanistan      Asia  1997   41.763  22227415   635.341351
10    Afghanistan      Asia  2002   42.129  25268405   726.734055
11    Afghanistan      Asia  2007   43.828  31889923   974.580338
12        Albania    Europe  1952   55.230   1282697  1601.056136
13        Albania    Europe  1957   59.280   1476505  1942.284244
14        Albania    Europe  1962   64.820   1728137  2312.888958
15        Albania    Europe  1967   66.220   1984060  2760.196931
16        Albania    Europe  1972   67.690   2263554  3313.422188
17        Albania    Europe  1977   68.930   2509048  3533.003910
18        Albania    Europe  1982   70.420   2780097  3630.880722
19        Albania    Europe  1987   72.000   3075321  3738.932735
20        Albania    Europe  1992   71.581   3326498  2497.437901
21        Albania    Europe  1997   72.950   3428038  3193.054604
22        Albania    Europe  2002   75.651   3508512  4604.211737
23        Albania    Europe  2007   76.423   3600523  5937.029526
24        Algeria    Africa  1952   43.077   9279525  2449.008185
25        Algeria    Africa  1957   45.685  10270856  3013.976023
26        Algeria    Africa  1962   48.303  11000948  2550.816880
27        Algeria    Africa  1967   51.407  12760499  3246.991771
28        Algeria    Africa  1972   54.518  14760787  4182.663766
29        Algeria    Africa  1977   58.014  17152804  4910.416756
...       ...   ...      ...       ...          ...
1674  Yemen, Rep.      Asia  1982   49.113   9657618  1977.557010
1675  Yemen, Rep.      Asia  1987   52.922  11219340  1971.741538
1676  Yemen, Rep.      Asia  1992   55.599  13367997  1879.496673
1677  Yemen, Rep.      Asia  1997   58.020  15826497  2117.484526
1678  Yemen, Rep.      Asia  2002   60.308  18701257  2234.820827
1679  Yemen, Rep.      Asia  2007   62.698  22211743  2280.769906
1680       Zambia    Africa  1952   42.038   2672000  1147.388831
1681       Zambia    Africa  1957   44.077   3016000  1311.956766
1682       Zambia    Africa  1962   46.023   3421000  1452.725766
1683       Zambia    Africa  1967   47.768   3900000  1777.077318
1684       Zambia    Africa  1972   50.107   4506497  1773.498265
1685       Zambia    Africa  1977   51.386   5216550  1588.688299
1686       Zambia    Africa  1982   51.821   6100407  1408.678565
1687       Zambia    Africa  1987   50.821   7272406  1213.315116
1688       Zambia    Africa  1992   46.100   8381163  1210.884633
1689       Zambia    Africa  1997   40.238   9417789  1071.353818
1690       Zambia    Africa  2002   39.193  10595811  1071.613938
1691       Zambia    Africa  2007   42.384  11746035  1271.211593
1692     Zimbabwe    Africa  1952   48.451   3080907   406.884115
1693     Zimbabwe    Africa  1957   50.469   3646340   518.764268
1694     Zimbabwe    Africa  1962   52.358   4277736   527.272182
1695     Zimbabwe    Africa  1967   53.995   4995432   569.795071
1696     Zimbabwe    Africa  1972   55.635   5861135   799.362176
1697     Zimbabwe    Africa  1977   57.674   6642107   685.587682
1698     Zimbabwe    Africa  1982   60.363   7636524   788.855041
1699     Zimbabwe    Africa  1987   62.351   9216418   706.157306
1700     Zimbabwe    Africa  1992   60.377  10704340   693.420786
1701     Zimbabwe    Africa  1997   46.809  11404948   792.449960
1702     Zimbabwe    Africa  2002   39.989  11926563   672.038623
1703     Zimbabwe    Africa  2007   43.487  12311143   469.709298

[1704 rows x 6 columns]

1.2 Series/DataFrame.aggregate()

DataFrame.aggregate(func, axis=0, *args, **kwargs)
# func:
#       function
#       list of functions
#       dict of column names -> functions

1.2.1 apply on Series

Functions you want to apply, forms a list/dict as para funcs in .aggregate(funcs=...) . this function will return a DataFrame, each function create a new column in result DataFrame.

if aggregate takse a dict of (key='colName': value=fn), it will apply certain funcion on certain column if not 'colName' not exist in DataFrame who call aggregate, it will create a new one with column name given by dict.key

The code here, is Series.aggregate({dict}) so all funcions all applied element-wisely on this column, and create 3 column.

multi_agg = gapminder.groupby('continent')['lifeExp'].aggregate({
    'ncount' : np.count_nonzero,
    'mean' : np.mean,
    'std' : np.std
}).rename(columns = {'ncount': 'count',
                     'mean' : 'avg',
                     'std' : 'std_dev'
}).reset_index()
print ( multi_agg )

1.2.2 apply on DataFrame

DataFrame.aggregate({dict}) can apply different function on different row.

1.3 gropy by + mathematical operation

  • count
  • siez
  • mean
  • std
  • min
  • quantile( q=0.25 )
  • max
  • sum
  • var
  • sem
  • describe <<< : important! This method will give you the [count, mean, std, min, 25%, 50%, 75%, max] of the column value for each row
  • first
  • last
  • nth

1.3.1 using built-in mathematical function for Series

1.3.2 groupby + mean()

print ( type(gapminder.groupby('year')) )
gapminder.groupby('year')['lifeExp'].mean()

1.3.3 loc_filter + mean()

y1952 = gapminder.loc[gapminder['year'] == 1952, :]
print ( y1952['lifeExp'] )
print ( y1952['lifeExp'].mean() )
0       28.801
12      55.230
24      43.077
36      30.015
48      62.485
60      69.120
72      66.800
84      50.939
96      37.484
108     68.000
120     38.223
132     40.414
144     53.820
156     47.622
168     50.917
180     59.600
192     31.975
204     39.031
216     39.417
228     38.523
240     68.750
252     35.463
264     38.092
276     54.745
288     44.000
300     50.643
312     40.715
324     39.143
336     42.111
348     57.206
         ...
1344    30.331
1356    60.396
1368    64.360
1380    65.570
1392    32.978
1404    45.009
1416    64.940
1428    57.593
1440    38.635
1452    41.407
1464    71.860
1476    69.620
1488    45.883
1500    58.500
1512    41.215
1524    50.848
1536    38.596
1548    59.100
1560    44.600
1572    43.585
1584    39.978
1596    69.180
1608    68.440
1620    66.071
1632    55.088
1644    40.412
1656    43.160
1668    32.548
1680    42.038
1692    48.451
Name: lifeExp, Length: 142, dtype: float64
49.057619718309866

1.3.4 groupby + describe()

print ( gapminder.groupby('year')['lifeExp'].describe() )
count       mean        std     min       25%      50%       75%     max
year
1952  142.0  49.057620  12.225956  28.801  39.05900  45.1355  59.76500  72.670
1957  142.0  51.507401  12.231286  30.332  41.24750  48.3605  63.03675  73.470
1962  142.0  53.609249  12.097245  31.997  43.46850  50.8810  65.23450  73.680
1967  142.0  55.678290  11.718858  34.020  46.03375  53.8250  67.41950  74.160
1972  142.0  57.647386  11.381953  35.400  48.50025  56.5300  69.24750  74.720
1977  142.0  59.570157  11.227229  31.220  50.47550  59.6720  70.38250  76.110
1982  142.0  61.533197  10.770618  38.445  52.94000  62.4415  70.92125  77.110
1987  142.0  63.212613  10.556285  39.906  54.94075  65.8340  71.87725  78.670
1992  142.0  64.160338  11.227380  23.599  56.12175  67.7030  72.58250  79.360
1997  142.0  65.014676  11.559439  36.087  55.63375  69.3940  74.16975  80.690
2002  142.0  65.694923  12.279823  39.193  55.52225  70.8255  75.45925  82.000
2007  142.0  67.007423  12.073021  39.613  57.16025  71.9355  76.41325  82.603

1.3.5 using user-define function by agg(user_defined_fn), other than the built-in

DataFrame.aggregate(func, axis=0, *args, **kwargs)
# func:
#       function
#       list of functions
#       dict of column names -> functions

1.3.6 using function of numpy

import numpy as np
agg_mean = gapminder.groupby('continent')['lifeExp'].agg(np.mean)
agg_std = gapminder.groupby('continent')['lifeExp'].agg(np.mean)
print ( agg_mean )
print ( agg_std )
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

1.3.7 using ONE user_defined_fn

def my_mean(values):
    n = len( values )
    s= np.sum(values)
    return s/n

my_mean_life = gapminder.groupby('continent')['lifeExp'].agg(my_mean)
print ( my_mean_life )
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

1.3.8 using MULTIPLE user_defined_fn

Functions you want to apply, forms a list/dict as para funcs in .aggregate(funcs=...) . this function will return a DataFrame, each function create a new column in result DataFrame.

if aggregate takse a dict of (key='colName': value=fn), it will apply certain funcion on certain column if not 'colName' not exist in DataFrame who call aggregate, it will create a new one with column name given by dict.key

The code here, is Series.aggregate({dict}) so all funcions all applied element-wisely on this column, and create 3 column. DataFrame.aggregate({dict}) can apply different function on different row.

multi_agg = gapminder.groupby('continent')['lifeExp'].aggregate({
    'ncount' : np.count_nonzero,
    'mean' : np.mean,
    'std' : np.std
}).rename(columns = {'ncount': 'count',
                     'mean' : 'avg',
                     'std' : 'std_dev'
}).reset_index()
print ( multi_agg )
print ( type(multi_agg ) )
continent  count        avg    std_dev
0    Africa  624.0  48.865330   9.150210
1  Americas  300.0  64.658737   9.345088
2      Asia  396.0  60.064903  11.864532
3    Europe  360.0  71.903686   5.433178
4   Oceania   24.0  74.326208   3.795611
<class 'pandas.core.frame.DataFrame'>

1.4 apply() vs. aggregate() vs. transform

  apply(fn) aggregate(fn) transform(fn)
on df fn on each row/col <"" fn on each element
on series fn on each element <"" fn on each element
on groupedDf fn on each row/col of each groupe <"" fn on each element
on groupedSeries fn on each element of row of group <"" fn on each element

1.4.1 apply vs. aggregate

apply() and aggregate() are used on groupedDataFrame or groupedSeries


apply() vs. aggregate()

df -> groupby -> groupedDf -> ['colName'] -> groupedCol -> ~apply(fn)~ =>
Series(
fn(colGrp1),
fn(colGrp2),
fn(colGrp3),
...
)

df -> groupby -> groupedDf -> ['colName'] -> groupedCol -> ~aggregate({fn1, fn2})~ =>
DataFrame(
series( fn1(colGrp1), fn2(colGrp2), ...),
series( fn2(colGrp1), fn2(colGrp2), ...)
)

1.4.2 apply vs. transform

dfx = pd.DataFrame( np.random.randn(10, 3),
                    columns=['A', 'B', 'C'],
                    index = pd.date_range('1/1/2000', periods=10))
transformed_dfx = dfx['A'].transform(lambda x: (x-x.mean())/x.std())
print ( transformed_dfx)
2000-01-01    0.649323
2000-01-02    1.490414
2000-01-03    0.782475
2000-01-04   -0.204436
2000-01-05   -1.812279
2000-01-06   -1.231838
2000-01-07    0.240099
2000-01-08   -0.694474
2000-01-09    0.558441
2000-01-10    0.222276
Freq: D, Name: A, dtype: float64