Pandas-笔记1
Table of Contents
- 1. Intro to pandas, part-I
- 2. Assemble, part-II
- 3. handle missing value NaN, part-III
- 3.1. missing value default representation:
np.NaN
- 3.2. missing value checking:
pd.isnull(df)
- 3.3. missing value default representation change by read_csv(na_values)
- 3.4. count the non-missing values
- 3.5. count the different values on each row/column
- 3.6. missing value default representation change by fillna(val)
- 3.7. missing value default representation change by fillna(method)
- 3.8. handling the
NaN
before mathematicall operation - 3.9. Illustration of
fillna(val)
andfillna(method=ffill)
andfillna(method=bfill)
- 3.10. replace A,B and C with NaN
- 3.11. nunique() vs. unique() vs. value_coutns() vs. info()
- 3.1. missing value default representation:
- 4. Tidy Data, part-IV
- 4.1. tidy your data
- 4.1.1. non-tidy, exp1: enumerate all the sub-range as column name
- 4.1.2. make horizontal spreading vertical by
pd.melt
- 4.1.3. Illustration of
pd.melt
- 4.1.4. non-tidy, exp2: enumerate all the time-period as column name
- 4.1.5. make horizontal spreading vertical by
pd.melt
- 4.1.6. non-tidy, exp3: enumerate all the age-range as column name
- 4.1.7. tidy operation step 1:
Series.str + melt = get new column
- 4.1.8. tidy operation step 2: assign new column to original dataframe
- 4.1.9. Series.str model
- 4.1.10. a general method using
series.str
+df.melt
- 4.1.11. combine multiple
.str
model expressions - 4.1.12.
.str.split('_', expand=True)
- 4.1.13. how to rename the column name of a dataframe
- 4.1.14. non-tidy, exp4: enumerate all days in a month as column name
- 4.1.15. 0. read in dataset to be a dataframe
- 4.1.16. 1. melt all columns d1 to d31 as values of column day
- 4.1.17. 2. tmax and tmin the column name
- 4.1.18. what is
pd.pivot_table
- 4.1.19. pivot_table and melt
- 4.1. tidy your data
- 5. Data types, part-V
1 Intro to pandas, part-I
import pandas as pd print (pandas.__version__) # the read_csv is default separate columns inside of file by ',' , this can be modified by 2nd parameter gap_path = '/home/yiddi/worklap/PythonML/Shan-HungWu DL/Pandas/scipy-2017-tutorial-pandas/data/gapminder.tsv' df = pd.read_csv(gap_path, delimiter='\t') print (df) print (type(df))
0.20.3 country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 1 Afghanistan Asia 1957 30.332 9240934 820.853030 2 Afghanistan Asia 1962 31.997 10267083 853.100710 3 Afghanistan Asia 1967 34.020 11537966 836.197138 4 Afghanistan Asia 1972 36.088 13079460 739.981106 5 Afghanistan Asia 1977 38.438 14880372 786.113360 6 Afghanistan Asia 1982 39.854 12881816 978.011439 7 Afghanistan Asia 1987 40.822 13867957 852.395945 8 Afghanistan Asia 1992 41.674 16317921 649.341395 9 Afghanistan Asia 1997 41.763 22227415 635.341351 10 Afghanistan Asia 2002 42.129 25268405 726.734055 11 Afghanistan Asia 2007 43.828 31889923 974.580338 12 Albania Europe 1952 55.230 1282697 1601.056136 13 Albania Europe 1957 59.280 1476505 1942.284244 14 Albania Europe 1962 64.820 1728137 2312.888958 15 Albania Europe 1967 66.220 1984060 2760.196931 16 Albania Europe 1972 67.690 2263554 3313.422188 17 Albania Europe 1977 68.930 2509048 3533.003910 18 Albania Europe 1982 70.420 2780097 3630.880722 19 Albania Europe 1987 72.000 3075321 3738.932735 20 Albania Europe 1992 71.581 3326498 2497.437901 21 Albania Europe 1997 72.950 3428038 3193.054604 22 Albania Europe 2002 75.651 3508512 4604.211737 23 Albania Europe 2007 76.423 3600523 5937.029526 24 Algeria Africa 1952 43.077 9279525 2449.008185 25 Algeria Africa 1957 45.685 10270856 3013.976023 26 Algeria Africa 1962 48.303 11000948 2550.816880 27 Algeria Africa 1967 51.407 12760499 3246.991771 28 Algeria Africa 1972 54.518 14760787 4182.663766 29 Algeria Africa 1977 58.014 17152804 4910.416756 ... ... ... ... ... ... 1674 Yemen, Rep. Asia 1982 49.113 9657618 1977.557010 1675 Yemen, Rep. Asia 1987 52.922 11219340 1971.741538 1676 Yemen, Rep. Asia 1992 55.599 13367997 1879.496673 1677 Yemen, Rep. Asia 1997 58.020 15826497 2117.484526 1678 Yemen, Rep. Asia 2002 60.308 18701257 2234.820827 1679 Yemen, Rep. Asia 2007 62.698 22211743 2280.769906 1680 Zambia Africa 1952 42.038 2672000 1147.388831 1681 Zambia Africa 1957 44.077 3016000 1311.956766 1682 Zambia Africa 1962 46.023 3421000 1452.725766 1683 Zambia Africa 1967 47.768 3900000 1777.077318 1684 Zambia Africa 1972 50.107 4506497 1773.498265 1685 Zambia Africa 1977 51.386 5216550 1588.688299 1686 Zambia Africa 1982 51.821 6100407 1408.678565 1687 Zambia Africa 1987 50.821 7272406 1213.315116 1688 Zambia Africa 1992 46.100 8381163 1210.884633 1689 Zambia Africa 1997 40.238 9417789 1071.353818 1690 Zambia Africa 2002 39.193 10595811 1071.613938 1691 Zambia Africa 2007 42.384 11746035 1271.211593 1692 Zimbabwe Africa 1952 48.451 3080907 406.884115 1693 Zimbabwe Africa 1957 50.469 3646340 518.764268 1694 Zimbabwe Africa 1962 52.358 4277736 527.272182 1695 Zimbabwe Africa 1967 53.995 4995432 569.795071 1696 Zimbabwe Africa 1972 55.635 5861135 799.362176 1697 Zimbabwe Africa 1977 57.674 6642107 685.587682 1698 Zimbabwe Africa 1982 60.363 7636524 788.855041 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786 1701 Zimbabwe Africa 1997 46.809 11404948 792.449960 1702 Zimbabwe Africa 2002 39.989 11926563 672.038623 1703 Zimbabwe Africa 2007 43.487 12311143 469.709298 [1704 rows x 6 columns] <class 'pandas.core.frame.DataFrame'>
# shape is attribute, not method print (df.shape) # info() will give you the infomation about the dataframe print (df.info()) # pull row/column from dataframe df['country']
(1704, 6) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1704 entries, 0 to 1703 Data columns (total 6 columns): country 1704 non-null object continent 1704 non-null object year 1704 non-null int64 lifeExp 1704 non-null float64 pop 1704 non-null int64 gdpPercap 1704 non-null float64 dtypes: float64(2), int64(2), object(2) memory usage: 80.0+ KB None
1.0.1 pull out columns
ONE row and column you pull out from a DataFrame is of pandas.core.series.Series
type
Multiple rows and columns you pull out from a DataFrame is of pandas.core.frame.DataFrame
type
# pull row/column from dataframe # pull one column: you get Series type # df['country'] and df.country country_df = df['country'] # print(country_df) # print (df.country) print (type(df['country'])) print (type(df.country)) # pull multiple column: you get DataFrame type # df.head() get the forth 4 row of this DataFrame multitest = df[['country', 'continent', 'year']].head() print (multitest) print (type(multitest)) # to see an the columns of a df print (df.columns)
# method 1: del del df['country'] # def df.country # ERROR, not work print (df.columns)
# method 2: drop # inplace parameter will make it disapear in orginial df # df.drop('continent', axis=1, inplace=True) # make inplace parameter False(default) will not apply directly in orginial df.drop('continent', axis=1) print (df.columns)
drop( 'row/col Name', axis=0 ) ===> drop row drop( 'row/col Name', axis=1 ) ===> drop column
1.0.2 pull out rows
ONE row and column you pull out from a DataFrame is of
pandas.core.series.Series
type; Multiple rows and columns you pull out
from a DataFrame is of pandas.core.frame.DataFrame
type
1.0.3 pull out ONE row
df = pd.read_csv(gap_path, delimiter='\t') print (type(df.loc[0])) print (df.loc[0]) print (df.loc[99]) # print (df.loc[-1]) # ERROR, '-1' is not a label
<class 'pandas.core.series.Series'> country Afghanistan continent Asia year 1952 lifeExp 28.801 pop 8425333 gdpPercap 779.445 Name: 0, dtype: object country Bangladesh continent Asia year 1967 lifeExp 43.453 pop 62821884 gdpPercap 721.186 Name: 99, dtype: object
loc
will ONLY pull out the rows which is labeled, default the DataFrame object will label each line by 0 to last lines number
we can index the row like an slicing in python list, use iloc
instead
loc[i]
: i is the label of DataFrame- sort will not change the label, be careful when you use
loc[i]
after sorting
- sort will not change the label, be careful when you use
iloc[i]
: i is the location of row of DataFrame- sort will change the location, be careful when you use
iloc[i]
after sorting
- sort will change the location, be careful when you use
print (type(df.iloc[0])) print (df.iloc[0]) print (df.iloc[-1]) # RIGHT, '-1' is a location
country lifeExp gdpPercap 0 Afghanistan 28.801 779.445314 99 Bangladesh 43.453 721.186086 999 Mongolia 51.253 1226.041130
print (df.ix[0])
country Afghanistan continent Asia year 1952 lifeExp 28.801 pop 8425333 gdpPercap 779.445 Name: 0, dtype: object
1.0.4 pull out MANY row
print (type(df.ix[[0, 99, 999]])) print (df.ix[[0, 99, 999]])
<class 'pandas.core.frame.DataFrame'> country continent year lifeExp pop gdpPercap 0 Afghanistan Asia 1952 28.801 8425333 779.445314 99 Bangladesh Asia 1967 43.453 62821884 721.186086 999 Mongolia Asia 1967 51.253 1149500 1226.041130 country lifeExp pop 0 Afghanistan 28.801 8425333 99 Bangladesh 43.453 62821884 999 Mongolia 51.253 1149500
1.0.5 pull out sub-dataframe
print (df.ix[[0, 99, 999], ['country', 'lifeExp', 'pop']]) print (df.loc[[0, 99, 999], ['country', 'lifeExp', 'pop']])
country lifeExp pop 0 Afghanistan 28.801 8425333 99 Bangladesh 43.453 62821884 999 Mongolia 51.253 1149500 country lifeExp pop 0 Afghanistan 28.801 8425333 99 Bangladesh 43.453 62821884 999 Mongolia 51.253 1149500
print (df.iloc[[0, 99, 999], [0, 3, 5]]) # RIGHT, location
country lifeExp gdpPercap 0 Afghanistan 28.801 779.445314 99 Bangladesh 43.453 721.186086 999 Mongolia 51.253 1226.041130
print (df.iloc[[0, 99, 999], ['country', 'lifeExp', 'pop']]) # ERROR, no label
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/indexing.py", line 1325, in __getitem__ return self._getitem_tuple(key) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/indexing.py", line 1662, in _getitem_tuple self._has_valid_tuple(tup) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/indexing.py", line 189, in _has_valid_tuple if not self._has_valid_type(k, i): File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/indexing.py", line 1599, in _has_valid_type return self._is_valid_list_like(key, axis) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/indexing.py", line 1655, in _is_valid_list_like (arr.max() >= l or arr.min() < -l)): File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/numpy/core/_methods.py", line 26, in _amax return umr_maximum(a, axis, None, out, keepdims) TypeError: cannot perform reduce with flexible type
1.0.6 mathematicall operation on row/column/subDF
row/column are type of Series
- <Series>.mean()
rows/columns/subDF are type of DataFrame
le_mean = df['lifeExp'].mean() print (le_mean)
59.474439366197174
1.0.7 slicing and boolean index as filter in loc
do same thing with the mask
in Numpy
g[ g[3] % 3 == 0 ]
df.loc[ df['lifeExp'] > le_mean, : ]
means get the subDF,
- whose row's lifeExp value must larger than le_mean;
- whose column has no constraint
df.loc[ df['lifeExp'] > le_mean, : ]
country continent year lifeExp pop gdpPercap 14 Albania Europe 1962 64.820 1728137 2312.888958 15 Albania Europe 1967 66.220 1984060 2760.196931 16 Albania Europe 1972 67.690 2263554 3313.422188 17 Albania Europe 1977 68.930 2509048 3533.003910 18 Albania Europe 1982 70.420 2780097 3630.880722 19 Albania Europe 1987 72.000 3075321 3738.932735 20 Albania Europe 1992 71.581 3326498 2497.437901 21 Albania Europe 1997 72.950 3428038 3193.054604 22 Albania Europe 2002 75.651 3508512 4604.211737 23 Albania Europe 2007 76.423 3600523 5937.029526 30 Algeria Africa 1982 61.368 20033753 5745.160213 31 Algeria Africa 1987 65.799 23254956 5681.358539 32 Algeria Africa 1992 67.744 26298373 5023.216647 33 Algeria Africa 1997 69.152 29072015 4797.295051 34 Algeria Africa 2002 70.994 31287142 5288.040382 35 Algeria Africa 2007 72.301 33333216 6223.367465 48 Argentina Americas 1952 62.485 17876956 5911.315053 49 Argentina Americas 1957 64.399 19610538 6856.856212 50 Argentina Americas 1962 65.142 21283783 7133.166023 51 Argentina Americas 1967 65.634 22934225 8052.953021 52 Argentina Americas 1972 67.065 24779799 9443.038526 53 Argentina Americas 1977 68.481 26983828 10079.026740 54 Argentina Americas 1982 69.942 29341374 8997.897412 55 Argentina Americas 1987 70.774 31620918 9139.671389 56 Argentina Americas 1992 71.868 33958947 9308.418710 57 Argentina Americas 1997 73.275 36203463 10967.281950 58 Argentina Americas 2002 74.340 38331121 8797.640716 59 Argentina Americas 2007 75.320 40301927 12779.379640 60 Australia Oceania 1952 69.120 8691212 10039.595640 61 Australia Oceania 1957 70.330 9712569 10949.649590 ... ... ... ... ... ... 1629 Uruguay Americas 1997 74.223 3262838 9230.240708 1630 Uruguay Americas 2002 75.307 3363085 7727.002004 1631 Uruguay Americas 2007 76.384 3447496 10611.462990 1634 Venezuela Americas 1962 60.770 8143375 8422.974165 1635 Venezuela Americas 1967 63.479 9709552 9541.474188 1636 Venezuela Americas 1972 65.712 11515649 10505.259660 1637 Venezuela Americas 1977 67.456 13503563 13143.950950 1638 Venezuela Americas 1982 68.557 15620766 11152.410110 1639 Venezuela Americas 1987 70.190 17910182 9883.584648 1640 Venezuela Americas 1992 71.150 20265563 10733.926310 1641 Venezuela Americas 1997 72.146 22374398 10165.495180 1642 Venezuela Americas 2002 72.766 24287670 8605.047831 1643 Venezuela Americas 2007 73.747 26084662 11415.805690 1651 Vietnam Asia 1987 62.820 62826491 820.799445 1652 Vietnam Asia 1992 67.662 69940728 989.023149 1653 Vietnam Asia 1997 70.672 76048996 1385.896769 1654 Vietnam Asia 2002 73.017 80908147 1764.456677 1655 Vietnam Asia 2007 74.249 85262356 2441.576404 1661 West Bank and Gaza Asia 1977 60.765 1261091 3682.831494 1662 West Bank and Gaza Asia 1982 64.406 1425876 4336.032082 1663 West Bank and Gaza Asia 1987 67.046 1691210 5107.197384 1664 West Bank and Gaza Asia 1992 69.718 2104779 6017.654756 1665 West Bank and Gaza Asia 1997 71.096 2826046 7110.667619 1666 West Bank and Gaza Asia 2002 72.370 3389578 4515.487575 1667 West Bank and Gaza Asia 2007 73.422 4018332 3025.349798 1678 Yemen, Rep. Asia 2002 60.308 18701257 2234.820827 1679 Yemen, Rep. Asia 2007 62.698 22211743 2280.769906 1698 Zimbabwe Africa 1982 60.363 7636524 788.855041 1699 Zimbabwe Africa 1987 62.351 9216418 706.157306 1700 Zimbabwe Africa 1992 60.377 10704340 693.420786 [895 rows x 6 columns]
1.1 group and aggregate statistics
1.1.1 groupby + index + mean
Split the data into different parts.
groupby return a DataFrameGroupBy object
pull out rows/columns from a DataFrameGroupBy object is same with DataFrame
pull out rows/columns from a DataFrameGroupBy object return a SeriesGroupBy
Series.mean/std/max get a value
SeriesGroupBy.mean/std/max get a Series(some like a list)
DataFrameGroupBy is a group of DataFrame
SeriesGroupBy is a group of Series
groupby(<fn(element)>)
=>pandas.core.groupby.DataFrameGroupBy
will get all rows with same
fn(element)
value- scala has a similar method called
groupBy(<fn(element)>)
, will give back a map whose key isfn(element)
grouped by, and value are all elements with samefn(element)
value
# gourby 1 column, indexing 1 column print ( type (df.groupby('year')['lifeExp'])) print ( type (df.groupby('year')['lifeExp'].mean()))
<class 'pandas.core.groupby.SeriesGroupBy'> <class 'pandas.core.series.Series'>
# gourby 1 column, indexing 2 column print ( type (df.groupby('year')['lifeExp', 'gdpPercap'])) print ( type (df.groupby('year')['lifeExp', 'gdpPercap'].mean()))
<class 'pandas.core.groupby.DataFrameGroupBy'> <class 'pandas.core.frame.DataFrame'>
# gourby 2 column, indexing 2 column print ( type (df.groupby(['year', 'continent'])['lifeExp', 'gdpPercap'])) print ( type (df.groupby(['year', 'continent'])['lifeExp', 'gdpPercap'].mean())) print ( df.groupby(['year', 'continent'])['lifeExp', 'gdpPercap']) print ( df.groupby(['year', 'continent'])['lifeExp', 'gdpPercap'].mean().reset_index())
lifeExp gdpPercap year continent 1952 Africa 39.135500 1252.572466 Americas 53.279840 4079.062552 Asia 46.314394 5195.484004 Europe 64.408500 5661.057435 year continent lifeExp gdpPercap 0 1952 Africa 39.135500 1252.572466 1 1952 Americas 53.279840 4079.062552 2 1952 Asia 46.314394 5195.484004 3 1952 Europe 64.408500 5661.057435 4 1952 Oceania 69.255000 10298.085650 5 1957 Africa 41.266346 1385.236062 6 1957 Americas 55.960280 4616.043733 7 1957 Asia 49.318544 5787.732940 8 1957 Europe 66.703067 6963.012816 9 1957 Oceania 70.295000 11598.522455 10 1962 Africa 43.319442 1598.078825 11 1962 Americas 58.398760 4901.541870 12 1962 Asia 51.563223 5729.369625 13 1962 Europe 68.539233 8365.486814 14 1962 Oceania 71.085000 12696.452430 15 1967 Africa 45.334538 2050.363801 16 1967 Americas 60.410920 5668.253496 17 1967 Asia 54.663640 5971.173374 18 1967 Europe 69.737600 10143.823757 19 1967 Oceania 71.310000 14495.021790 20 1972 Africa 47.450942 2339.615674 21 1972 Americas 62.394920 6491.334139 22 1972 Asia 57.319269 8187.468699 23 1972 Europe 70.775033 12479.575246 24 1972 Oceania 71.910000 16417.333380 25 1977 Africa 49.580423 2585.938508 26 1977 Americas 64.391560 7352.007126 27 1977 Asia 59.610556 7791.314020 28 1977 Europe 71.937767 14283.979110 29 1977 Oceania 72.855000 17283.957605 30 1982 Africa 51.592865 2481.592960 31 1982 Americas 66.228840 7506.737088 32 1982 Asia 62.617939 7434.135157 33 1982 Europe 72.806400 15617.896551 34 1982 Oceania 74.290000 18554.709840 35 1987 Africa 53.344788 2282.668991 36 1987 Americas 68.090720 7793.400261 37 1987 Asia 64.851182 7608.226508 38 1987 Europe 73.642167 17214.310727 39 1987 Oceania 75.320000 20448.040160 40 1992 Africa 53.629577 2281.810333 41 1992 Americas 69.568360 8044.934406 42 1992 Asia 66.537212 8639.690248 43 1992 Europe 74.440100 17061.568084 44 1992 Oceania 76.945000 20894.045885 45 1997 Africa 53.598269 2378.759555 46 1997 Americas 71.150480 8889.300863 47 1997 Asia 68.020515 9834.093295 48 1997 Europe 75.505167 19076.781802 49 1997 Oceania 78.190000 24024.175170 50 2002 Africa 53.325231 2599.385159 51 2002 Americas 72.422040 9287.677107 52 2002 Asia 69.233879 10174.090397 53 2002 Europe 76.700600 21711.732422 54 2002 Oceania 79.740000 26938.778040 55 2007 Africa 54.806038 3089.032605 56 2007 Americas 73.608120 11003.031625 57 2007 Asia 70.728485 12473.026870 58 2007 Europe 77.648600 25054.481636 59 2007 Oceania 80.719500 29810.188275
.reset_index()
used to flatten the index, this is important to keep different expression compatible
things return by .reset_index
can be used as source to other expression
before using ~.reset_index~ --------------------------- lifeExp gdpPercap year continent 1952 Africa 39.135500 1252.572466 Americas 53.279840 4079.062552 Asia 46.314394 5195.484004 Europe 64.408500 5661.057435 Oceania 69.255000 10298.085650 after using ~.reset_index~ --------------------------- year continent lifeExp gdpPercap 0 1952 Africa 39.135500 1252.572466 1 1952 Americas 53.279840 4079.062552 2 1952 Asia 46.314394 5195.484004 3 1952 Europe 64.408500 5661.057435 4 1952 Oceania 69.255000 10298.085650 5 1957 Africa 41.266346 1385.236062
1.1.2 Illustration of groupby + index + mean 1
. df.groupby('year') . ================= . Afghanistan Asia 1952 28.801 8425333 779.4453145--------->|Afghanistan Asia 1952 ... | . Afghanistan Asia 1957 30.332 9240934 820.8530296 |Albania Europe 1952 ... | . ... |Angola Africa 1952 ... | . Albania Europe 1952 55.23 1282697 1601.056136------------->|Australia Oceania 1952 ... | . Albania Europe 1957 59.28 1476505 1942.284244 ^ ^ . ... | | . Angola Africa 1952 30.015 4232095 3520.610273---------------------+ | . Angola Africa 1957 31.999 4561361 3827.940465 | . ... | . Australia Oceania 1952 69.12 8691212 10039.59564-------------------------------------+ . Australia Oceania 1957 70.33 9712569 10949.64959 . df.groupby('year') ['year'] . mean() . ================================================================================== . DataFrameGroupBy SeriesGroupBy Series . | | | . v v v . |-------------+---------+------+-----| . |group1 | . |-------------+---------+------+-----| +------+ . | Afghanistan | Asia | 1952 | ... | | 1952 | +------+ . | Albania | Europe | 1952 | ... | | 1952 | =====> | 1952 | . | Angola | Africa | 1952 | ... | | 1952 | | | . | Australia | Oceania | 1952 | ... | | 1952 | | | . |-------------+---------+------+-----| +------+ | | . |group2 | | | . |-------------+---------+------+-----| ===> +------+ | | . | Afghanistan | Asia | 1957 | ... | | 1957 | | | . | Albania | Europe | 1957 | ... | | 1957 | =====> | 1957 | . | Angola | Africa | 1957 | ... | | 1957 | | | . | Australia | Oceania | 1957 | ... | | 1957 | | | . |-------------+---------+------+-----| +------+ | | . |group3 | | | . |-------------+---------+------+-----| +------+ | | . | Afghanistan | Asia | 1959 | ... | | 1959 | | | . | Albania | Europe | 1959 | ... | | 1959 | =====> | 1959 | . | Angola | Africa | 1959 | ... | | 1959 | +------+ . | Australia | Oceania | 1959 | ... | | 1959 | . |-------------+---------+------+-----| +------+ . | ... | .... | ... | ... | | ... |
1.1.3 Illustration of groupby + index + mean 2
*DataFrame* --groupby--> *DataFrameGroupBy* --[indexRow/Column]--> *SeriesGroupBy* --mean--> *Series* ............................................--[indexMultiRows/Columns] --> *DataFrameGroupBy* --mean--> *DataFrame* . DataFrame DataFrameGroupBy SeriesGroupBy or DataFrameGroupBy Series or DataFrame . . +----------+ ++ +-----+ . | | || | | . | | || | | ++ ++ +-----+ +-----+- . +------------------+ | | || | | ++ ++ +-----+ +-----+ . | | +----------+ ++ +-----+ ++ +-----+ . | | +----------+ ++ +-----+ ++ +-----+ . | | groupby | | indexing || or | | mean ++ ++ +-----+ +-----+ . | +--------->| | --------> || | | --------> ++ +-----+ . | | +----------+ ++ +-----+ . | | +----------+ ++ +-----+ . | | | | || | | ++ +-----+ . +------------------+ | | || | | ++ +-----+ . | | || | | . +----------+ ++ +-----+ . +----------+ ++ +-----+ . | | || | | ++ +-----+ . | | || | | ++ +-----+ . | | || | | . +----------+ ++ +-----+ .
1.1.4 reset_index
.reset_index()
used to flatten the index, this is important to keep different expression compatible
things return by .reset_index
can be used as source DataFrame to other expression.
before using ~.reset_index~, this is a ~Series~ --------------------------- lifeExp gdpPercap year continent 1952 Africa 39.135500 1252.572466 Americas 53.279840 4079.062552 Asia 46.314394 5195.484004 Europe 64.408500 5661.057435 Oceania 69.255000 10298.085650 after using ~.reset_index~, this is a ~DataFrame~ --------------------------- year continent lifeExp gdpPercap 0 1952 Africa 39.135500 1252.572466 1 1952 Americas 53.279840 4079.062552 2 1952 Asia 46.314394 5195.484004 3 1952 Europe 64.408500 5661.057435 4 1952 Oceania 69.255000 10298.085650 5 1957 Africa 41.266346 1385.236062
print ("--------bad format---------") before_reset = df.groupby(['year', 'continent'])['lifeExp', 'gdpPercap'].mean()[:5] print ( before_reset ) print ( type(before_reset ) ) print ("\n") print ("--------good format---------") after_reset= df.groupby(['year', 'continent'])['lifeExp', 'gdpPercap'].mean().reset_index()[:5] print ( after_reset ) print ( type(after_reset ) )
--------bad format--------- lifeExp gdpPercap year continent 1952 Africa 39.135500 1252.572466 Americas 53.279840 4079.062552 Asia 46.314394 5195.484004 Europe 64.408500 5661.057435 Oceania 69.255000 10298.085650 <class 'pandas.core.frame.DataFrame'> --------good format--------- year continent lifeExp gdpPercap 0 1952 Africa 39.135500 1252.572466 1 1952 Americas 53.279840 4079.062552 2 1952 Asia 46.314394 5195.484004 3 1952 Europe 64.408500 5661.057435 4 1952 Oceania 69.255000 10298.085650 <class 'pandas.core.frame.DataFrame'>
1.1.5 groupby + index + nunique
nunique = number of unique
count the number of non-duplicate items of each group after indexing.
In this example, you can get how many contries in each continent in this file.
df.groupby('continent')['country'].nunique()
continent Africa 52 Americas 25 Asia 33 Europe 30 Oceania 2 Name: country, dtype: int64
1.1.6 groupby + indexing + mean + plot
gyle = df.groupby('year')['lifeExp'].mean() print (gyle) import matplotlib.pyplot as plt gyle.plot() plt.show()
year 1952 49.057620 1957 51.507401 1962 53.609249 1967 55.678290 1972 57.647386 1977 59.570157 1982 61.533197 1987 63.212613 1992 64.160338 1997 65.014676 2002 65.694923 2007 67.007423 Name: lifeExp, dtype: float64
1.1.7 groupby + indexing + mean + output_to_file
ONLY DataFrame has API to_csv
, to_exel
, to_sql
…, and everytime you
want to safe DF to file, you should reset_index()
to convert a
bad-formatted Series
to a well-formatted DataFrame
.
setup the parameter of .to_csv index=False
will not display the index
number in output file.
gyle = df.groupby('year')['lifeExp'].mean() # Series new_df = gyle.reset_index() # Series --reset_index--> DataFrame new_df.to_csv('./Pandas/output/lifeExp_by_year.csv') new_df.to_csv('./Pandas/output/lifeExp_by_year_noIndex.csv', index=False)
1.2 odo : shapeshifting for your data
By this package of python, you can convert your data from/to almost any format
1.3 Pandas.DataFrame vs. Numpy.array
DataFrame has label and location | array ONLY has location |
---|---|
pull out row is different from pull out column | same with pull out row and column |
DON'T support slicing | support slicing |
- column: <df_name>['rowName1', 'rowName2'] | - column: <arr_name>[:, location-num] |
- row: <df_name>.loc[lable-num] | - row: <arr_name>[locaion-num, :] |
- row: <df_name>.iloc[location-num] | - subDF: <arr_name>[[1:-1:2], [1:14]] |
- subDF: <df_name>.ix[[label-num],['rowName']] | - support mask, some like fiter |
filter : df[ df['year'] > 1990 ] | filter : arr[ arr[:,2] >= 30 ] |
df.ix[[0, 99, 999], ['country', 'lifeExp', 'pop']] | arr[[0,1,2,3], [1,2,3,4]] |
---|---|
: country lifeExp pop | [ 1 7 13 19] |
: 0 Afghanistan 28.801 8425333 | |
: 99 Bangladesh 43.453 62821884 | |
: 999 Mongolia 51.253 1149500 | |
arr[[0:4], [1:5]] | |
. 1 . 2 . 3 . 4 . | |
. 6 . 7 . 8 . 9 . | |
. 11 .12 . 13. 14. | |
. 16 .17 . 18. 19. |
1.4 loc vs. iloc vs. ix vs. df[]
method | return type | interpretation |
---|---|---|
df[] | Series | ONLY for column; label-based |
df.loc[] | Series/DataFrame | for DataFrame; label-based |
df.ix[] | Series/DataFrame | for DataFrame; label-based |
df.iloc[] | Series/DataFrame | for DataFrame; location-based |
df[0] | ERROR | ERROR |
df['year'] | Series | df[] can ONLY used to access columns |
df[] is also label-based | ||
df.loc[0] | Series | get 1st row |
df.ix[0] | Series | get 1st row |
df.iloc[0] | Series | get 1st row |
df.loc[1:3] | DataFrame | get 1 ~ 3 rows |
df.ix[1:3] | DataFrame | get 1 ~ 3 rows |
df.iloc[1:3] | DataFrame | get 1 ~ 2 rows |
df.loc[1:3, 1:3] | ERROR | ERROR |
df.ix[1:3, 1:3] | ERROR | ERROR |
df.iloc[1:3, 1:3] | DataFrame | get row1~2,column1~2 |
df.loc[1:3, ['year']] | DataFrame | get row1~3,column'year' |
df.ix[1:3, ['year']] | DataFrame | get row1~3,column'year' |
df.iloc[1:3, ['year']] | ERROR | ERROR |
df.loc[1:3, 'year'] | Series | get row1~3,column'year' |
df.ix[1:3, 'year'] | Series | get row1~3,column'year' |
df.iloc[1:3, 1] | Series | get row1~3,column'year' |
# print ( df.ix[0] ) # print ( df.loc[0] ) # print ( df.iloc[0] ) # print ( df.ix[1:3] ) # print ( df.loc[1:3] ) # print ( df.iloc[1:3] ) # print ( df.loc[1:3, ['year']] )
country Afghanistan continent Asia year 1957 lifeExp 30.332 pop 9240934 gdpPercap 820.853 Name: 1, dtype: object
2 Assemble, part-II
Dataset are in different files, you should assemble them together before do data analysis. sometimes you should do analysis on many separate file, like time series analysis, log file analysis.
2.1 concatenate separate dataframes as rows
concat(axis=0)
axis=0 is default, if column name miss match, automaticlly fit NaN
import pandas as pd file_path = '/home/yiddi/worklap/PythonML/Shan-HungWu DL/Pandas/scipy-2017-tutorial-pandas/data/' df1 = pd.read_csv(file_path+'concat_1.csv') df2 = pd.read_csv(file_path+'concat_2.csv') df3 = pd.read_csv(file_path+'concat_3.csv') print (df1) print (df2) print (df3)
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 A B C D 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7 A B C D 0 a8 b8 c8 d8 1 a9 b9 c9 d9 2 a10 b10 c10 d10 3 a11 b11 c11 d11
2.1.1 concate([df1,def2,…]) + loc
concat()
don't require files match each other, they don't have to be in the same order.
concat()
will keep the label of each rows given by there orginial file
so you can use loc[labelNum]
to index them
when you want the 1st row of the concatenated DataFrame, you should use iloc[locationNum]
row_concat = pd.concat([df1, df2, df3]) print (row_concat) print (type(row_concat)) print (row_concat.loc[0]) # pull out all rows labled `0` print (row_concat.iloc[0])# pull out 1st row
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7 0 a8 b8 c8 d8 1 a9 b9 c9 d9 2 a10 b10 c10 d10 3 a11 b11 c11 d11 <class 'pandas.core.frame.DataFrame'> A B C D 0 a0 b0 c0 d0 0 a4 b4 c4 d4 0 a8 b8 c8 d8 A a0 B b0 C c0 D d0 Name: 0, dtype: object
2.1.2 Illustration of concate + loc
. >>>>> pd.concat([df1,df2,df3]).loc[0] . . +-------------> labels given by original file . | | . | | . |*0*| a0 b0 c0 d0 | . | 1 | a1 b1 c1 d1 loc[0] . | 2 | a2 b2 c2 d2 | . | 3 | a3 b3 c3 d3 +----------> A B C D . |*0*| a4 b4 c4 d4 |0| a0 b0 c0 d0 . | 1 | a5 b5 c5 d5 |0| a4 b4 c4 d4 . | 2 | a6 b6 c6 d6 |0| a8 b8 c8 d8 . | 3 | a7 b7 c7 d7 . |*0*| a8 b8 c8 d8 . | 1 | a9 b9 c9 d9 . | 2 | a10 b10 c10 d10 . | 3 | a11 b11 c11 d11
2.1.3 concate + reset_index
print (row_concat.reset_index())
index A B C D 0 0 a0 b0 c0 d0 1 1 a1 b1 c1 d1 2 2 a2 b2 c2 d2 3 3 a3 b3 c3 d3 4 0 a4 b4 c4 d4 5 1 a5 b5 c5 d5 6 2 a6 b6 c6 d6 7 3 a7 b7 c7 d7 8 0 a8 b8 c8 d8 9 1 a9 b9 c9 d9 10 2 a10 b10 c10 d10 11 3 a11 b11 c11 d11
2.1.4 concate + Series + DataFrame = bad-formatted df
new_row = pd.Series(['n1', 'n2', 'n3', 'n4']) print (new_row)
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/tmp/babel-32610Fla/python-32610khl", line 2, in <module> site = pd.read_csv('./scipy-2017-tutorial-pandas/data/survey_site.csv') File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/io/parsers.py", line 655, in parser_f return _read(filepath_or_buffer, kwds) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/io/parsers.py", line 405, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/io/parsers.py", line 764, in __init__ self._make_engine(self.engine) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/io/parsers.py", line 985, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/io/parsers.py", line 1605, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 394, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4209) File "pandas/_libs/parsers.pyx", line 710, in pandas._libs.parsers.TextReader._setup_parser_source (pandas/_libs/parsers.c:8873) FileNotFoundError: File b'./scipy-2017-tutorial-pandas/data/survey_site.csv' does not exist
pd.concat([df1, new_row])
A B C D 0 0 a0 b0 c0 d0 NaN 1 a1 b1 c1 d1 NaN 2 a2 b2 c2 d2 NaN 3 a3 b3 c3 d3 NaN 0 NaN NaN NaN NaN n1 1 NaN NaN NaN NaN n2 2 NaN NaN NaN NaN n3 3 NaN NaN NaN NaN n4
2.1.5 concate + DataFrame + DataFrame = well-formatted df
when you want to concatenate user-defined row to an dataframe, you must specify the same column-label.
when you using pd.DataFrame(1st, 2nd)
to create your own df, you should specify 2nd para.
new_row_2 = pd.DataFrame([['n1', 'n2', 'n3', 'n4']], columns = ['A', 'B', 'C', 'D']) print (new_row_2)
A B C D 0 n1 n2 n3 n4
pd.concat( [df1, new_row_2] )
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 n1 n2 n3 n4
2.1.6 concate + DataFrame + shape-mis-match DataFrame = automaticlly NaN
will automaticlly add-in NaN
, for mis-match location
new_row_3 = pd.DataFrame([['n1', 'n2', 'n4']], columns = ['A', 'B', 'D']) print (pd.concat([df1, new_row_3]))
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 0 n1 n2 NaN n4
2.2 concatenate separate dataframes as columns
2.2.1 concate(axis=1) + loc/iloc/ix/df[]
Almost the same with concatenation as rows, the ONLY difference is concat(..., axis=1)
col_concat = pd.concat([df1, df2, df3], axis = 1) print ( col_concat ) print ( col_concat['A'] ) print ( col_concat.loc[:,['A']] ) print ( col_concat.ix[:,'A'] ) print ( col_concat.iloc[1, 0:3] )
A B C D A B C D A B C D 0 a0 b0 c0 d0 a4 b4 c4 d4 a8 b8 c8 d8 1 a1 b1 c1 d1 a5 b5 c5 d5 a9 b9 c9 d9 2 a2 b2 c2 d2 a6 b6 c6 d6 a10 b10 c10 d10 3 a3 b3 c3 d3 a7 b7 c7 d7 a11 b11 c11 d11 A A A 0 a0 a4 a8 1 a1 a5 a9 2 a2 a6 a10 3 a3 a7 a11 A A A 0 a0 a4 a8 1 a1 a5 a9 2 a2 a6 a10 3 a3 a7 a11 A A A 0 a0 a4 a8 1 a1 a5 a9 2 a2 a6 a10 3 a3 a7 a11 A a1 B b1 C c1 Name: 1, dtype: object
2.2.2 reset the column-label by df.columns
attribute
df1.columns = ['A', 'B', 'C', 'D'] df2.columns = ['A', 'D', 'E', 'F'] print ( df1 ) print ( df2 ) print ( pd.concat([df1, df2]) )
A B C D 0 a0 b0 c0 d0 1 a1 b1 c1 d1 2 a2 b2 c2 d2 3 a3 b3 c3 d3 A D E F 0 a4 b4 c4 d4 1 a5 b5 c5 d5 2 a6 b6 c6 d6 3 a7 b7 c7 d7 A B C D E F 0 a0 b0 c0 d0 NaN NaN 1 a1 b1 c1 d1 NaN NaN 2 a2 b2 c2 d2 NaN NaN 3 a3 b3 c3 d3 NaN NaN 0 a4 NaN NaN b4 c4 d4 1 a5 NaN NaN b5 c5 d5 2 a6 NaN NaN b6 c6 d6 3 a7 NaN NaN b7 c7 d7
2.2.3 reset the row-label by df.index
attribute
df1.index = [0, 2, 5, 7] print ( df1 ) print ( pd.concat([df1, df2], axis=1) )
A B C D 0 a0 b0 c0 d0 2 a1 b1 c1 d1 5 a2 b2 c2 d2 7 a3 b3 c3 d3 A B C D A D E F 0 a0 b0 c0 d0 a4 b4 c4 d4 1 NaN NaN NaN NaN a5 b5 c5 d5 2 a1 b1 c1 d1 a6 b6 c6 d6 3 NaN NaN NaN NaN a7 b7 c7 d7 5 a2 b2 c2 d2 NaN NaN NaN NaN 7 a3 b3 c3 d3 NaN NaN NaN NaN
2.2.4 join differet files like SQL
Note that, join in pandas is NOT join in SQL, merge in pandas is same with join in SQL.
survey_path = '/home/yiddi/worklap/PythonML/Shan-HungWu DL/Pandas/scipy-2017-tutorial-pandas/data/' person = pd.read_csv(survey_path + 'survey_person.csv') site = pd.read_csv(survey_path + 'survey_site.csv') survey = pd.read_csv(survey_path + 'survey_survey.csv') visited = pd.read_csv(survey_path + 'survey_visited.csv')
visited_subset = visited.iloc[[0,2,6], :] print ( site ) print ( visited_subset)
name lat long 0 DR-1 -49.85 -128.57 1 DR-3 -47.15 -126.72 2 MSK-4 -48.87 -123.40 ident site dated 0 619 DR-1 1927-02-08 2 734 DR-3 1939-01-07 6 837 MSK-4 1932-01-14
o2o = pd.merge(site, visited_subset, left_on='name', right_on='site') print ( o2o )
name lat long ident site dated 0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08 1 DR-3 -47.15 -126.72 734 DR-3 1939-01-07 2 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14
.Illustration of merge(join) . .pd.merge(site, visted_subset, left_on='name', right_on='site') . ------------- -------------- . . | | . v v . name lat long | ident site dated . 0 DR-1 -49.85 -128.57 | 0 619 DR-1 1927-02-080 . 1 DR-3 -47.15 -126.72 | 2 734 DR-3 1939-01-071 . 2 MSK-4 -48.87 -123.40 | 6 837 MSK-4 1932-01-142 . . =====> mergeto : *both keys(columns) of right and left, are kept in result df* . . *name* lat long ident *site* dated . 0 *DR-1* -49.85 -128.57 619 *DR-1* 1927-02-08 . 1 *DR-3* -47.15 -126.72 734 *DR-3* 1939-01-07 . 2 *MSK-4* -48.87 -123.40 837 *MSK-4* 1932-01-14 .
m2m = pd.merge(site, visited_subset, left_on=['name', 'long'], right_on=['site', 'dated']) print ( o2o )
Empty DataFrame Columns: [name, lat, long, ident, site, dated] Index: []
m = site.merge(visited, left_on='name', right_on='site') print (m)
name lat long ident site dated 0 DR-1 -49.85 -128.57 619 DR-1 1927-02-08 1 DR-1 -49.85 -128.57 622 DR-1 1927-02-10 2 DR-1 -49.85 -128.57 844 DR-1 1932-03-22 3 DR-3 -47.15 -126.72 734 DR-3 1939-01-07 4 DR-3 -47.15 -126.72 735 DR-3 1930-01-12 5 DR-3 -47.15 -126.72 751 DR-3 1930-02-26 6 DR-3 -47.15 -126.72 752 DR-3 NaN 7 MSK-4 -48.87 -123.40 837 MSK-4 1932-01-14
. Illustration of merge(join) . . +-------+--------+---------+ +-------+-------+------------+ . | name | lat | long | | ident | site | dated | . | DR-1 | -49.85 | -128.57 | | 619 | DR-1 | 1927-02-08 | . | DR-3 | -47.15 | -126.72 | | 622 | DR-1 | 1927-02-10 | . | MSK-4 | -48.87 | -123.4 | | 734 | DR-3 | 1939-01-07 | . | | | | | 735 | DR-3 | 1930-01-12 | . | | | | | 751 | DR-3 | 1930-02-26 | . | | | | | 752 | DR-3 | | . | | | | | 837 | MSK-4 | 1932-01-14 | . | | | | | 844 | DR-1 | 1932-03-22 | . . . +-------------------------------------+ . v | . | DR-1 | -49.85 | -128.57 | 619 | DR-1 | 1927-02-08 | . | ^ 622 | DR-1 | 1927-02-10 | . | +-------------------------------------- . | . | *many right match left, left will copy many times itself* . | . v . 0 *DR-1* -49.85 -128.57 619 DR-1 1927-02-08 . 1 *DR-1* -49.85 -128.57 622 DR-1 1927-02-10 . Illustration of merge(join) . . +-------+--------+---------+ +-------+-------+------------+ . | name | lat | long | | ident | site | dated | . +-------+--------+---------+ +-------+-------+------------+ . | DR-1 | -49.85 | -128.57 | | 619 | DR-1 | 1927-02-08 | . | DR-3 | -47.15 | -126.72 | | 622 | DR-1 | 1927-02-10 | . | MSK-4 | -48.87 | -123.4 | | 734 | DR-3 | 1939-01-07 | . | | | | | 735 | DR-3 | 1930-01-12 | . | | | | | 751 | DR-3 | 1930-02-26 | . | | | | | 752 | DR-3 | ----+----+ . | | | | | 837 | MSK-4 | 1932-01-14 | | . | | | | | 844 | DR-1 | 1932-03-22 | | . | . =====> merge to: | *missing value will* . | *change to ~NaN~ automaticlly* . +-------------------------------------------------------+ | . | name | lat | long | ident | site | dated | | . +-------------------------------------------------------+ | . 0 | DR-1 | -49.85 | -128.57 | 619 | DR-1 | 1927-02-08 | | . 1 | DR-1 | -49.85 | -128.57 | 622 | DR-1 | 1927-02-10 | | . 2 | DR-1 | -49.85 | -128.57 | 844 | DR-1 | 1932-03-22 | | . 3 | DR-3 | -47.15 | -126.72 | 734 | DR-3 | 1939-01-07 | | . 4 | DR-3 | -47.15 | -126.72 | 735 | DR-3 | 1930-01-12 | | . 5 | DR-3 | -47.15 | -126.72 | 751 | DR-3 | 1930-02-26 | | . 6 | DR-3 | -47.15 | -126.72 | 752 | DR-3 | *NaN* |<-------+
2.2.5 build a DataFrame from dict
of python
Note that, every items
of dict
represent a column
of DataFrame
df1 = pd.DataFrame({ 'a' : [1,1,1,2,2], # column_name and column_values 'b' : [10,20,30,40,50] }) df2 = pd.DataFrame({ 'a1' : [1,1,2,2,3], # column_name and column_values 'b1' : [100,200,300,400,500] }) df1_merge_df2 = df1.merge(df2, left_on='a', right_on='a1') print (df1) print (df2) print ( df1_merge_df2 )
a b 0 1 10 1 1 20 2 1 30 3 2 40 4 2 50 a1 b1 0 1 100 1 1 200 2 2 300 3 2 400 4 3 500 a b a1 b1 0 1 10 1 100 1 1 10 1 200 2 1 20 1 100 3 1 20 1 200 4 1 30 1 100 5 1 30 1 200 6 2 40 2 300 7 2 40 2 400 8 2 50 2 300 9 2 50 2 400
2.2.6 check the duplicates after concatenation
using DataFrame.duplicated('columnname')
to check whether there are
duplicate.
print ( df1.duplicated('a') ) print ( df1_merge_df2.duplicated('a') )
0 False 1 True 2 True 3 False 4 True dtype: bool 0 False 1 True 2 True 3 True 4 True 5 True 6 False 7 True 8 True 9 True dtype: bool
3 handle missing value NaN, part-III
3.1 missing value default representation: np.NaN
from numpy import NaN, NAN, nan print ( nan == True ) print ( nan == False ) print ( nan == nan )
visited = pd.read_csv(survey_path + 'survey_visited.csv') print ( visited )
ident site dated 0 619 DR-1 1927-02-08 1 622 DR-1 1927-02-10 2 734 DR-3 1939-01-07 3 735 DR-3 1930-01-12 4 751 DR-3 1930-02-26 5 752 DR-3 NaN 6 837 MSK-4 1932-01-14 7 844 DR-1 1932-03-22
3.2 missing value checking: pd.isnull(df)
print ( pd.isnull(nan) ) print ( pd.isnull(42) ) print ( pd.isnull(survey) ) # return a boolean df with same shape # 'True', if missing; 'Fale' if not.
True False taken person quant reading 0 False False False False 1 False False False False 2 False False False False 3 False False False False 4 False False False False 5 False False False False 6 False False False False 7 False False False False 8 False True False False 9 False True False False 10 False False False False 11 False False False False 12 False False False False 13 False False False False 14 False False False False 15 False False False False 16 False False False False 17 False False False False 18 False False False False 19 False False False False 20 False False False False
3.3 missing value default representation change by read_csv(na_values)
In many AI algorithms, you should give some predefined value instead of NaN
to missing value.
you can achieve this by na_values
parameter of pd.read_csv()
method.
visited = pd.read_csv(survey_path + 'survey_visited.csv') survey = pd.read_csv(survey_path + 'survey_survey.csv') vs = visited.merge(survey, left_on='ident', right_on='taken') print ( vs )
ident site dated taken person quant reading 0 619 DR-1 1927-02-08 619 dyer rad 9.82 1 619 DR-1 1927-02-08 619 dyer sal 0.13 2 622 DR-1 1927-02-10 622 dyer rad 7.80 3 622 DR-1 1927-02-10 622 dyer sal 0.09 4 734 DR-3 1939-01-07 734 pb rad 8.41 5 734 DR-3 1939-01-07 734 lake sal 0.05 6 734 DR-3 1939-01-07 734 pb temp -21.50 7 735 DR-3 1930-01-12 735 pb rad 7.22 8 735 DR-3 1930-01-12 735 NaN sal 0.06 9 735 DR-3 1930-01-12 735 NaN temp -26.00 10 751 DR-3 1930-02-26 751 pb rad 4.35 11 751 DR-3 1930-02-26 751 pb temp -18.50 12 751 DR-3 1930-02-26 751 lake sal 0.10 13 752 DR-3 NaN 752 lake rad 2.19 14 752 DR-3 NaN 752 lake sal 0.09 15 752 DR-3 NaN 752 lake temp -16.00 16 752 DR-3 NaN 752 roe sal 41.60 17 837 MSK-4 1932-01-14 837 lake rad 1.46 18 837 MSK-4 1932-01-14 837 lake sal 0.21 19 837 MSK-4 1932-01-14 837 roe sal 22.50 20 844 DR-1 1932-03-22 844 roe rad 11.25
3.4 count the non-missing values
df.info()
can give you the number of non-null
value of each column
ebola = pd.read_csv( survey_path + 'ebola_country_timeseries.csv') print ( ebola.info() )
<class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 18 columns): Date 122 non-null object Day 122 non-null int64 Cases_Guinea 93 non-null float64 Cases_Liberia 83 non-null float64 Cases_SierraLeone 87 non-null float64 Cases_Nigeria 38 non-null float64 Cases_Senegal 25 non-null float64 Cases_UnitedStates 18 non-null float64 Cases_Spain 16 non-null float64 Cases_Mali 12 non-null float64 Deaths_Guinea 92 non-null float64 Deaths_Liberia 81 non-null float64 Deaths_SierraLeone 87 non-null float64 Deaths_Nigeria 38 non-null float64 Deaths_Senegal 22 non-null float64 Deaths_UnitedStates 18 non-null float64 Deaths_Spain 16 non-null float64 Deaths_Mali 12 non-null float64 dtypes: float64(16), int64(1), object(1) memory usage: 17.2+ KB None
3.5 count the different values on each row/column
series.value_counts()
can give you the number of occurence of each value of a column/row.
the result is a sorted series, top the most frequent value and in descending order.
you can choose to count the NaN
value by giving the drop=False
parameter
print ( ebola['Cases_Guinea'].value_counts().head() ) print ( ebola['Cases_Guinea'].value_counts(dropna=False).head() ) # take NaN into account # count NaN in this row/column
86.0 3 112.0 2 390.0 2 495.0 2 2597.0 1 Name: Cases_Guinea, dtype: int64 NaN 29 86.0 3 495.0 2 112.0 2 390.0 2 Name: Cases_Guinea, dtype: int64 88 [2776. 2775. 2769. nan 2730. 2706. 2695. 2630. 2597. 2571. 2416. 2292. 2164. 2134. 2047. 1971. 1919. 1878. 1760. 1731. 1667. 1906. 1553. 1540. 1519. 1472. 1350. 1298. 1199. 1157. 1074. 1022. 1008. 942. 936. 899. 861. 812. 771. 648. 607. 579. 543. 519. 510. 506. 495. 485. 472. 460. 427. 415. 410. 411. 406. 409. 408. 412. 413. 390. 398. 351. 344. 328. 291. 281. 258. 248. 233. 236. 235. 231. 226. 224. 218. 208. 203. 197. 168. 159. 158. 151. 143. 127. 122. 112. 103. 86. 49.] <class 'pandas.core.frame.DataFrame'> RangeIndex: 122 entries, 0 to 121 Data columns (total 18 columns): Date 122 non-null object Day 122 non-null int64 Cases_Guinea 93 non-null float64 Cases_Liberia 83 non-null float64 Cases_SierraLeone 87 non-null float64 Cases_Nigeria 38 non-null float64 Cases_Senegal 25 non-null float64 Cases_UnitedStates 18 non-null float64 Cases_Spain 16 non-null float64 Cases_Mali 12 non-null float64 Deaths_Guinea 92 non-null float64 Deaths_Liberia 81 non-null float64 Deaths_SierraLeone 87 non-null float64 Deaths_Nigeria 38 non-null float64 Deaths_Senegal 22 non-null float64 Deaths_UnitedStates 18 non-null float64 Deaths_Spain 16 non-null float64 Deaths_Mali 12 non-null float64 dtypes: float64(16), int64(1), object(1) memory usage: 17.2+ KB None
3.6 missing value default representation change by fillna(val)
fillna(val)
will produce a new dataframe with all NaN
changed to val
print (ebola.head()) ebola2 = ebola.fillna(0) print (ebola2.head())
3.7 missing value default representation change by fillna(method)
fillna(method)
will produce a new dataframe with all NaN
changed by methods:
fillna( method='ffill' )
: makeNaN
equall to the upper(forward) one in the same column- Note that
ffill
will make theNaN
on the first row stillNaN
(no upper value in the same column) fillna( method='bfill' )
: makeNaN
equall to the backer(backward) one in the same column- Note that
ffill
will make theNaN
on the first row stillNaN
(no upper value in the same column)
print (ebola.iloc[1:5, 1:4]) ebola2 = ebola.fillna(0) print (ebola2.iloc[1:5, 1:4]) ebola3 = ebola.fillna(method='ffill') print (ebola3.iloc[1:5, 1:4]) ebola4 = ebola.fillna(method='bfill') print (ebola4.iloc[1:5, 1:4])
Day Cases_Guinea Cases_Liberia 1 288 2775.0 NaN 2 287 2769.0 8166.0 3 286 NaN 8157.0 4 284 2730.0 8115.0 Day Cases_Guinea Cases_Liberia 1 288 2775.0 0.0 2 287 2769.0 8166.0 3 286 0.0 8157.0 4 284 2730.0 8115.0 Day Cases_Guinea Cases_Liberia 1 288 2775.0 NaN 2 287 2769.0 8166.0 3 286 2769.0 8157.0 4 284 2730.0 8115.0 Day Cases_Guinea Cases_Liberia 1 288 2775.0 8166.0 2 287 2769.0 8166.0 3 286 2730.0 8157.0 4 284 2730.0 8115.0
3.8 handling the NaN
before mathematicall operation
because NaN +-*/ anyvalue = NaN
, NaN
propagate like a virus.
so, you should hande the NaN
first, then you can do some computation on the whole df or series.
sum()
method will see the NaN
as zero by default, you can fix this by setup False the sikpna
parameter.
print (ebola['Cases_Guinea'].sum( skipna=False)) print (ebola2['Cases_Guinea'].sum(skipna=False)) print (ebola3['Cases_Guinea'].sum(skipna=False)) print (ebola4['Cases_Guinea'].sum(skipna=False))
nan 84729.0 124895.0 122627.0
3.9 Illustration of fillna(val)
and fillna(method=ffill)
and fillna(method=bfill)
Day Cases_Guinea Cases_Liberia 1 288 2775.0 2 287 2769.0 3 286 NaN 4 284 2730.0
ebola2 = ebola.fillna(0)
Day Cases_Guinea 1 288 2775.0 2 287 2769.0 3 286 >>>>>>> NaN = 0.0 <<<<<<< *a fixed value by fillna(0)* 4 284 2730.0
ebola3 = ebola.fillna(method='ffill')
Day Cases_Guinea 1 288 2775.0 2 287 2769.0 <--- 3 286 >>>>>>> NaN = 2769.0 <<<<<<< *a non-fixed value equall to upper one* 4 284 2730.0
ebola4 = ebola.fillna(method='bfill')
Day Cases_Guinea 1 288 2775.0 2 287 2769.0 3 286 >>>>>>> NaN = 2730.0 <<<<<<< *a non-fixed value equall to backer one* 4 284 2730.0 <---
3.10 replace A,B and C with NaN
df.replace(1st, 2nd)
is a general method to do replacement:
1st
, is the list of all values indf
you want to replace.2nd
, is the value you want to replace with(here is theNaN
).
before_replace_df = pd.DataFrame({ 'a' : [1.0, 2.0, 88, 99], 'b' : [3.0, NaN, 999, 5] }) print (before_replace_df) after_replace_df = before_replace_df.replace(to_replace=[88, 99, 5], value=NaN) print (after_replace_df)
a b 0 1.0 3.0 1 2.0 NaN 2 88.0 999.0 3 99.0 5.0 a b 0 1.0 3.0 1 2.0 NaN 2 NaN 999.0 3 NaN NaN
3.11 nunique() vs. unique() vs. value_coutns() vs. info()
method: | return: |
---|---|
df/series.nunique | number of unique value |
series.unique | list all unique values |
series.value_counts | list all occurence time of unique values in descend ordering |
df.info | number of non-null values |
print ( ebola['Cases_Guinea'].value_counts().head() ) print ( ebola['Cases_Guinea'].value_counts(dropna=False).head() ) # take NaN into account # count NaN in this row/column print ( ebola['Cases_Guinea'].nunique() ) # give you number of unique(one value count one time) value print ( ebola['Cases_Guinea'].unique() ) print ( ebola.info() )
4 Tidy Data, part-IV
Reference of paper: What is a tidy data.
- Each variable forms a column;
- Each observation forms a row;
- Each type of observation unit forms a table.
4.1 tidy your data
Tidying your data is very different from presenting your data, if your data is tidy you can do transformation very quickly
4.1.1 non-tidy, exp1: enumerate all the sub-range as column name
here pew.csv is not clean data set, because 'religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused' the column name is the variable, they are somelike value, a bad-format variable name like "<$10k"
pew = pd.read_csv(file_path + 'pew.csv') print ( pew )
religion <$10k $10-20k $20-30k $30-40k $40-50k \ 0 Agnostic 27 34 60 81 76 1 Atheist 12 27 37 52 35 2 Buddhist 27 21 30 34 33 3 Catholic 418 617 732 670 638 4 Don’t know/refused 15 14 15 11 10 5 Evangelical Prot 575 869 1064 982 881 6 Hindu 1 9 7 9 11 7 Historically Black Prot 228 244 236 238 197 8 Jehovah's Witness 20 27 24 24 21 9 Jewish 19 19 25 25 30 10 Mainline Prot 289 495 619 655 651 11 Mormon 29 40 48 51 56 12 Muslim 6 7 9 10 9 13 Orthodox 13 17 23 32 32 14 Other Christian 9 7 11 13 13 15 Other Faiths 20 33 40 46 49 16 Other World Religions 5 2 3 4 2 17 Unaffiliated 217 299 374 365 341 $50-75k $75-100k $100-150k >150k Don't know/refused 0 137 122 109 84 96 1 70 73 59 74 76 2 58 62 39 53 54 3 1116 949 792 633 1489 4 35 21 17 18 116 5 1486 949 723 414 1529 6 34 47 48 54 37 7 223 131 81 78 339 8 30 15 11 6 37 9 95 69 87 151 162 10 1107 939 753 634 1328 11 112 85 49 42 69 12 23 16 8 6 22 13 47 38 42 46 73 14 14 18 14 12 18 15 63 46 40 41 71 16 7 3 4 4 8 17 528 407 321 258 597
4.1.2 make horizontal spreading vertical by pd.melt
melt
make horizon vertically
melted_data = pd.melt(frame=pew, id_vars='religion') melted_data2 = pd.melt(frame=pew, id_vars='religion', var_name='income', value_name='count') print ( pew ) print ( melted_data2 )
religion <$10k $10-20k $20-30k $30-40k $40-50k \ 0 Agnostic 27 34 60 81 76 1 Atheist 12 27 37 52 35 2 Buddhist 27 21 30 34 33 3 Catholic 418 617 732 670 638 4 Don’t know/refused 15 14 15 11 10 5 Evangelical Prot 575 869 1064 982 881 6 Hindu 1 9 7 9 11 7 Historically Black Prot 228 244 236 238 197 8 Jehovah's Witness 20 27 24 24 21 9 Jewish 19 19 25 25 30 10 Mainline Prot 289 495 619 655 651 11 Mormon 29 40 48 51 56 12 Muslim 6 7 9 10 9 13 Orthodox 13 17 23 32 32 14 Other Christian 9 7 11 13 13 15 Other Faiths 20 33 40 46 49 16 Other World Religions 5 2 3 4 2 17 Unaffiliated 217 299 374 365 341 $50-75k $75-100k $100-150k >150k Don't know/refused 0 137 122 109 84 96 1 70 73 59 74 76 2 58 62 39 53 54 3 1116 949 792 633 1489 4 35 21 17 18 116 5 1486 949 723 414 1529 6 34 47 48 54 37 7 223 131 81 78 339 8 30 15 11 6 37 9 95 69 87 151 162 10 1107 939 753 634 1328 11 112 85 49 42 69 12 23 16 8 6 22 13 47 38 42 46 73 14 14 18 14 12 18 15 63 46 40 41 71 16 7 3 4 4 8 17 528 407 321 258 597 religion income count 0 Agnostic <$10k 27 1 Atheist <$10k 12 2 Buddhist <$10k 27 3 Catholic <$10k 418 4 Don’t know/refused <$10k 15 5 Evangelical Prot <$10k 575 6 Hindu <$10k 1 7 Historically Black Prot <$10k 228 8 Jehovah's Witness <$10k 20 9 Jewish <$10k 19 10 Mainline Prot <$10k 289 11 Mormon <$10k 29 12 Muslim <$10k 6 13 Orthodox <$10k 13 14 Other Christian <$10k 9 15 Other Faiths <$10k 20 16 Other World Religions <$10k 5 17 Unaffiliated <$10k 217 18 Agnostic $10-20k 34 19 Atheist $10-20k 27 20 Buddhist $10-20k 21 21 Catholic $10-20k 617 22 Don’t know/refused $10-20k 14 23 Evangelical Prot $10-20k 869 24 Hindu $10-20k 9 25 Historically Black Prot $10-20k 244 26 Jehovah's Witness $10-20k 27 27 Jewish $10-20k 19 28 Mainline Prot $10-20k 495 29 Mormon $10-20k 40 .. ... ... ... 150 Hindu >150k 54 151 Historically Black Prot >150k 78 152 Jehovah's Witness >150k 6 153 Jewish >150k 151 154 Mainline Prot >150k 634 155 Mormon >150k 42 156 Muslim >150k 6 157 Orthodox >150k 46 158 Other Christian >150k 12 159 Other Faiths >150k 41 160 Other World Religions >150k 4 161 Unaffiliated >150k 258 162 Agnostic Don't know/refused 96 163 Atheist Don't know/refused 76 164 Buddhist Don't know/refused 54 165 Catholic Don't know/refused 1489 166 Don’t know/refused Don't know/refused 116 167 Evangelical Prot Don't know/refused 1529 168 Hindu Don't know/refused 37 169 Historically Black Prot Don't know/refused 339 170 Jehovah's Witness Don't know/refused 37 171 Jewish Don't know/refused 162 172 Mainline Prot Don't know/refused 1328 173 Mormon Don't know/refused 69 174 Muslim Don't know/refused 22 175 Orthodox Don't know/refused 73 176 Other Christian Don't know/refused 18 177 Other Faiths Don't know/refused 71 178 Other World Religions Don't know/refused 8 179 Unaffiliated Don't know/refused 597 [180 rows x 3 columns]
4.1.3 Illustration of pd.melt
melt
make horizon vertically
<$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k |
---|---|---|---|---|---|---|---|---|
… | … | … | … | … | … | … | … | … |
pd.melt
variable |
---|
<$10k |
<$10k |
<$10k |
<$10k |
… |
$10-20k |
$10-20k |
$10-20k |
$10-20k |
… |
$20-30k |
$20-30k |
$20-30k |
$20-30k |
1. ~id_vars~ keep the column unchanged 2. add a new column called ~variable~ 3. column_name copy n times as value of this new column 4. change orginial column to name ~value~
. ~melted_data = pd.melt(frame=pew, id_vars='religion', var_name='income', value_name='count')~ . ---------.-------- ---------.------ ---------.------- . | | | . +----------------------------+ 1. ~id_vars~ keep| the column unchan|ged . | | | . v | | . | | religion | <$10k | | | . |---+--------------------+-------| | | . | 0 | Agnostic | 27 | | | . | 1 | Atheist | 12 | | | . | 2 | Buddhist | 27 | | | . | 3 | Catholic | 418 | | | . | 4 | Don’t know/refused | 15 | | | . | 5 | Evangelical Prot | 575 | | | . | | . +------------ 2. add a new colu|mn called ~variabl|e~ . v 2.1 you can set/ its name by para ~v|ar_name~ . | | religion | variable | <$10k | | . |---+--------------------+-----------+-------| | . | 0 | Agnostic | | 27 | | . | 1 | Atheist | | 12 | | . | 2 | Buddhist | | 27 | | . | 3 | Catholic | | 418 | | . | 4 | Don’t know/refused | | 15 | | . | 5 | Evangelical Prot | | 575 | | . | 6 | Hindu | | 1 | | . | . +--------- 3. column_name copy n times as value| of this new column . | | . | | religion | variabl|e | <$10k | | . |---+--------------------+--------+--+-------| | . | 0 | Agnostic | <$10k <| | 27 | | . | 1 | Atheist | <$10k | 12 | | . | 2 | Buddhist | <$10k | 27 | | . | 3 | Catholic | <$10k | 418 | | . | 4 | Don’t know/refused | <$10k | 15 | | . | 5 | Evangelical Prot | <$10k | 575 | | . | 6 | Hindu | <$10k | 1 | | . #+TBLNAME: pew_after_melt | . | . +-- 4. change orginial column to name ~v|alue~ . v 4.1 you can set/ its name by para ~value_name~ . | | religion | variable | value | . |---+--------------------+-----------+-------| . | 0 | Agnostic | <$10k | 27 | . | 1 | Atheist | <$10k | 12 | . | 2 | Buddhist | <$10k | 27 | . | 3 | Catholic | <$10k | 418 | . | 4 | Don’t know/refused | <$10k | 15 | . | 5 | Evangelical Prot | <$10k | 575 | . | 6 | Hindu | <$10k | 1 |
4.1.4 non-tidy, exp2: enumerate all the time-period as column name
billboard = pd.read_csv(file_path + 'billboard.csv') print (billboard.head())
year artist track time date.entered wk1 wk2 \ 0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 87 82.0 1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87.0 2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70.0 3 2000 3 Doors Down Loser 4:24 2000-10-21 76 76.0 4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 57 34.0 wk3 wk4 wk5 ... wk67 wk68 wk69 wk70 wk71 wk72 wk73 wk74 \ 0 72.0 77.0 87.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 1 92.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 2 68.0 67.0 66.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 3 72.0 69.0 67.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 4 25.0 17.0 17.0 ... NaN NaN NaN NaN NaN NaN NaN NaN wk75 wk76 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN [5 rows x 81 columns]
4.1.5 make horizontal spreading vertical by pd.melt
melt
make horizon vertically
billboard_after_melt= pd.melt(billboard, id_vars=['year', 'artist', 'track', 'time', 'date.entered'], var_name='week', value_name='rank') print (billboard_after_melt)
year artist track time date.entered \ 0 2000 2 Pac Baby Don't Cry (Keep... 4:22 2000-02-26 1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 3 2000 3 Doors Down Loser 4:24 2000-10-21 4 2000 504 Boyz Wobble Wobble 3:35 2000-04-15 5 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 6 2000 A*Teens Dancing Queen 3:44 2000-07-08 7 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 8 2000 Aaliyah Try Again 4:03 2000-03-18 9 2000 Adams, Yolanda Open My Heart 5:30 2000-08-26 10 2000 Adkins, Trace More 3:05 2000-04-29 11 2000 Aguilera, Christina Come On Over Baby (A... 3:38 2000-08-05 12 2000 Aguilera, Christina I Turn To You 4:00 2000-04-15 13 2000 Aguilera, Christina What A Girl Wants 3:18 1999-11-27 14 2000 Alice Deejay Better Off Alone 6:50 2000-04-08 15 2000 Allan, Gary Smoke Rings In The D... 4:18 2000-01-22 16 2000 Amber Sexual 4:38 1999-07-17 17 2000 Anastacia I'm Outta Love 4:01 2000-04-01 18 2000 Anthony, Marc My Baby You 3:59 2000-09-16 19 2000 Anthony, Marc You Sang To Me 3:50 2000-02-26 20 2000 Avant My First Love 4:28 2000-11-04 21 2000 Avant Separated 4:13 2000-04-29 22 2000 BBMak Back Here 3:35 2000-04-29 23 2000 Backstreet Boys, The Shape Of My Heart 3:49 2000-10-14 24 2000 Backstreet Boys, The Show Me The Meaning ... 3:54 2000-01-01 25 2000 Backstreet Boys, The The One 3:46 2000-05-27 26 2000 Badu, Erkyah Bag Lady 5:03 2000-08-19 27 2000 Baha Men Who Let The Dogs Out 3:17 2000-07-22 28 2000 Barenaked Ladies Pinch Me 3:46 2000-09-09 29 2000 Beenie Man Girls Dem Sugar 4:17 2000-10-21 ... ... ... ... ... 24062 2000 Thomas, Carl Emotional 4:31 2000-11-25 24063 2000 Thomas, Carl I Wish 3:50 2000-03-25 24064 2000 Thomas, Carl Summer Rain 4:57 2000-09-23 24065 2000 Tippin, Aaron Kiss This 2:53 2000-08-26 24066 2000 Train Meet Virginia 3:55 1999-10-09 24067 2000 Trick Daddy Shut Up 4:17 2000-05-20 24068 2000 Trina Pull Over 2:58 2000-09-09 24069 2000 Tritt, Travis Best Of Intentions 4:15 2000-08-19 24070 2000 Tuesday I Know 4:06 2000-12-30 24071 2000 Urban, Keith Your Everything 4:10 2000-07-15 24072 2000 Usher Pop Ya Collar 3:36 2000-11-04 24073 2000 Vassar, Phil Carlene 4:07 2000-03-04 24074 2000 Vassar, Phil Just Another Day In ... 3:54 2000-09-30 24075 2000 Vertical Horizon Everything You Want 4:01 2000-01-22 24076 2000 Vertical Horizon You're A God 3:45 2000-08-26 24077 2000 Vitamin C Graduation (Friends ... 4:23 2000-04-15 24078 2000 Vitamin C The Itch 3:30 2000-12-02 24079 2000 Walker, Clay Live, Laugh, Love 4:06 1999-12-04 24080 2000 Walker, Clay The Chain Of Love 5:03 2000-04-15 24081 2000 Wallflowers, The Sleepwalker 3:29 2000-10-28 24082 2000 Westlife Swear It Again 4:07 2000-04-01 24083 2000 Williams, Robbie Angels 3:56 1999-11-20 24084 2000 Wills, Mark Back At One 4:00 2000-01-15 24085 2000 Worley, Darryl When You Need My Lov... 3:35 2000-06-17 24086 2000 Wright, Chely It Was 3:51 2000-03-04 24087 2000 Yankee Grey Another Nine Minutes 3:10 2000-04-29 24088 2000 Yearwood, Trisha Real Live Woman 3:55 2000-04-01 24089 2000 Ying Yang Twins Whistle While You Tw... 4:19 2000-03-18 24090 2000 Zombie Nation Kernkraft 400 3:30 2000-09-02 24091 2000 matchbox twenty Bent 4:12 2000-04-29 week rank 0 wk1 87.0 1 wk1 91.0 2 wk1 81.0 3 wk1 76.0 4 wk1 57.0 5 wk1 51.0 6 wk1 97.0 7 wk1 84.0 8 wk1 59.0 9 wk1 76.0 10 wk1 84.0 11 wk1 57.0 12 wk1 50.0 13 wk1 71.0 14 wk1 79.0 15 wk1 80.0 16 wk1 99.0 17 wk1 92.0 18 wk1 82.0 19 wk1 77.0 20 wk1 70.0 21 wk1 62.0 22 wk1 99.0 23 wk1 39.0 24 wk1 74.0 25 wk1 58.0 26 wk1 67.0 27 wk1 99.0 28 wk1 77.0 29 wk1 72.0 ... ... 24062 wk76 NaN 24063 wk76 NaN 24064 wk76 NaN 24065 wk76 NaN 24066 wk76 NaN 24067 wk76 NaN 24068 wk76 NaN 24069 wk76 NaN 24070 wk76 NaN 24071 wk76 NaN 24072 wk76 NaN 24073 wk76 NaN 24074 wk76 NaN 24075 wk76 NaN 24076 wk76 NaN 24077 wk76 NaN 24078 wk76 NaN 24079 wk76 NaN 24080 wk76 NaN 24081 wk76 NaN 24082 wk76 NaN 24083 wk76 NaN 24084 wk76 NaN 24085 wk76 NaN 24086 wk76 NaN 24087 wk76 NaN 24088 wk76 NaN 24089 wk76 NaN 24090 wk76 NaN 24091 wk76 NaN [24092 rows x 7 columns]
4.1.6 non-tidy, exp3: enumerate all the age-range as column name
some time you want to handle the Series
as a String
, and surely you can do this by Series.str
,this also generate a Series, which is compatible to combine with
other expression, require a Series.
tb = pd.read_csv(file_path + 'tb.csv') print ( tb )
iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 \ 0 AD 1989 NaN NaN NaN NaN NaN NaN NaN NaN 1 AD 1990 NaN NaN NaN NaN NaN NaN NaN NaN 2 AD 1991 NaN NaN NaN NaN NaN NaN NaN NaN 3 AD 1992 NaN NaN NaN NaN NaN NaN NaN NaN 4 AD 1993 NaN NaN NaN NaN NaN NaN NaN NaN 5 AD 1994 NaN NaN NaN NaN NaN NaN NaN NaN 6 AD 1996 NaN NaN 0.0 0.0 0.0 4.0 1.0 0.0 7 AD 1997 NaN NaN 0.0 0.0 1.0 2.0 2.0 1.0 8 AD 1998 NaN NaN 0.0 0.0 0.0 1.0 0.0 0.0 9 AD 1999 NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 10 AD 2000 NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 11 AD 2001 NaN NaN 0.0 NaN NaN 2.0 1.0 NaN 12 AD 2002 NaN NaN 0.0 0.0 0.0 1.0 0.0 0.0 13 AD 2003 NaN NaN 0.0 0.0 0.0 1.0 2.0 0.0 14 AD 2004 NaN NaN 0.0 0.0 0.0 1.0 1.0 0.0 15 AD 2005 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 16 AD 2006 0.0 0.0 0.0 1.0 1.0 2.0 0.0 1.0 17 AD 2007 NaN NaN NaN NaN NaN NaN NaN NaN 18 AD 2008 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 19 AE 1980 NaN NaN NaN NaN NaN NaN NaN NaN 20 AE 1981 NaN NaN NaN NaN NaN NaN NaN NaN 21 AE 1982 NaN NaN NaN NaN NaN NaN NaN NaN 22 AE 1983 NaN NaN NaN NaN NaN NaN NaN NaN 23 AE 1984 NaN NaN NaN NaN NaN NaN NaN NaN 24 AE 1985 NaN NaN NaN NaN NaN NaN NaN NaN 25 AE 1986 NaN NaN NaN NaN NaN NaN NaN NaN 26 AE 1987 NaN NaN NaN NaN NaN NaN NaN NaN 27 AE 1988 NaN NaN NaN NaN NaN NaN NaN NaN 28 AE 1989 NaN NaN NaN NaN NaN NaN NaN NaN 29 AE 1990 NaN NaN NaN NaN NaN NaN NaN NaN ... ... ... ... ... ... ... ... ... ... 5739 ZM 2008 NaN NaN 101.0 1120.0 3244.0 2094.0 737.0 299.0 5740 ZW 1980 NaN NaN NaN NaN NaN NaN NaN NaN 5741 ZW 1981 NaN NaN NaN NaN NaN NaN NaN NaN 5742 ZW 1982 NaN NaN NaN NaN NaN NaN NaN NaN 5743 ZW 1983 NaN NaN NaN NaN NaN NaN NaN NaN 5744 ZW 1984 NaN NaN NaN NaN NaN NaN NaN NaN 5745 ZW 1985 NaN NaN NaN NaN NaN NaN NaN NaN 5746 ZW 1986 NaN NaN NaN NaN NaN NaN NaN NaN 5747 ZW 1987 NaN NaN NaN NaN NaN NaN NaN NaN 5748 ZW 1988 NaN NaN NaN NaN NaN NaN NaN NaN 5749 ZW 1989 NaN NaN NaN NaN NaN NaN NaN NaN 5750 ZW 1990 NaN NaN NaN NaN NaN NaN NaN NaN 5751 ZW 1991 NaN NaN NaN NaN NaN NaN NaN NaN 5752 ZW 1992 NaN NaN NaN NaN NaN NaN NaN NaN 5753 ZW 1993 NaN NaN NaN NaN NaN NaN NaN NaN 5754 ZW 1994 NaN NaN NaN NaN NaN NaN NaN NaN 5755 ZW 1995 NaN NaN NaN NaN NaN NaN NaN NaN 5756 ZW 1996 NaN NaN NaN NaN NaN NaN NaN NaN 5757 ZW 1997 NaN NaN NaN NaN NaN NaN NaN NaN 5758 ZW 1998 NaN NaN NaN NaN NaN NaN NaN NaN 5759 ZW 1999 NaN NaN NaN NaN NaN NaN NaN NaN 5760 ZW 2000 NaN NaN NaN NaN NaN NaN NaN NaN 5761 ZW 2001 NaN NaN NaN NaN NaN NaN NaN NaN 5762 ZW 2002 NaN NaN 191.0 600.0 2548.0 1662.0 744.0 315.0 5763 ZW 2003 NaN NaN 133.0 874.0 3048.0 2228.0 981.0 367.0 5764 ZW 2004 NaN NaN 187.0 833.0 2908.0 2298.0 1056.0 366.0 5765 ZW 2005 NaN NaN 210.0 837.0 2264.0 1855.0 762.0 295.0 5766 ZW 2006 NaN NaN 215.0 736.0 2391.0 1939.0 896.0 348.0 5767 ZW 2007 6.0 132.0 138.0 500.0 3693.0 0.0 716.0 292.0 5768 ZW 2008 NaN NaN 127.0 614.0 0.0 3316.0 704.0 263.0 ... f04 f514 f014 f1524 f2534 f3544 f4554 f5564 f65 fu 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 ... NaN NaN 0.0 1.0 1.0 0.0 0.0 1.0 0.0 NaN 7 ... NaN NaN 0.0 1.0 2.0 3.0 0.0 0.0 1.0 NaN 8 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 9 ... NaN NaN 0.0 0.0 0.0 1.0 0.0 0.0 0.0 NaN 10 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 12 ... NaN NaN 0.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 13 ... NaN NaN 0.0 1.0 1.0 1.0 0.0 0.0 0.0 NaN 14 ... NaN NaN 0.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 15 ... 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 16 ... 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 17 ... NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN NaN 18 ... 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 19 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 21 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 22 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 23 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 24 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 26 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 28 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 29 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... ... ... ... ... ... ... ... ... ... ... 5739 ... NaN NaN 165.0 1246.0 2062.0 1114.0 498.0 187.0 115.0 0.0 5740 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5741 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5742 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5743 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5744 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5745 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5746 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5747 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5748 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5749 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5750 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5751 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5752 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5753 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5754 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5755 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5756 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5757 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5758 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5759 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5760 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5761 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5762 ... NaN NaN 222.0 914.0 2185.0 1095.0 421.0 140.0 65.0 NaN 5763 ... NaN NaN 180.0 1232.0 2856.0 1480.0 565.0 225.0 114.0 NaN 5764 ... NaN NaN 225.0 1140.0 2858.0 1565.0 622.0 214.0 111.0 NaN 5765 ... NaN NaN 269.0 1136.0 2242.0 1255.0 578.0 193.0 603.0 NaN 5766 ... NaN NaN 237.0 1020.0 2424.0 1355.0 632.0 230.0 96.0 NaN 5767 ... 7.0 178.0 185.0 739.0 3311.0 0.0 553.0 213.0 90.0 NaN 5768 ... NaN NaN 145.0 840.0 0.0 2890.0 467.0 174.0 105.0 0.0 [5769 rows x 22 columns]
m04 | male 0 ~ 4 years old |
m514 | male 5 ~ 14 years old |
… | … |
f04 | female 0 ~ 4 years old |
f514 | female 5 ~ 14 years old |
ebola_non_tidy = pd.read_csv(file_path + 'ebola_country_timeseries.csv') print ( ebola_non_tidy )
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \ 0 1/5/2015 289 2776.0 NaN 10030.0 1 1/4/2015 288 2775.0 NaN 9780.0 2 1/3/2015 287 2769.0 8166.0 9722.0 3 1/2/2015 286 NaN 8157.0 NaN 4 12/31/2014 284 2730.0 8115.0 9633.0 5 12/28/2014 281 2706.0 8018.0 9446.0 6 12/27/2014 280 2695.0 NaN 9409.0 7 12/24/2014 277 2630.0 7977.0 9203.0 8 12/21/2014 273 2597.0 NaN 9004.0 9 12/20/2014 272 2571.0 7862.0 8939.0 10 12/18/2014 271 NaN 7830.0 NaN 11 12/14/2014 267 2416.0 NaN 8356.0 12 12/9/2014 262 NaN 7797.0 NaN 13 12/7/2014 260 2292.0 NaN 7897.0 14 12/3/2014 256 NaN 7719.0 NaN 15 11/30/2014 253 2164.0 NaN 7312.0 16 11/28/2014 251 NaN 7635.0 NaN 17 11/23/2014 246 2134.0 NaN 6599.0 18 11/22/2014 245 NaN 7168.0 NaN 19 11/18/2014 241 2047.0 7082.0 6190.0 20 11/16/2014 239 1971.0 NaN 6073.0 21 11/15/2014 238 NaN 7069.0 NaN 22 11/11/2014 234 1919.0 NaN 5586.0 23 11/10/2014 233 NaN 6878.0 NaN 24 11/9/2014 232 1878.0 NaN 5368.0 25 11/8/2014 231 NaN 6822.0 NaN 26 11/4/2014 227 NaN 6619.0 4862.0 27 11/3/2014 226 1760.0 NaN NaN 28 11/2/2014 225 1731.0 NaN 4759.0 29 10/31/2014 222 NaN 6525.0 NaN .. ... ... ... ... ... 92 5/23/2014 62 258.0 12.0 0.0 93 5/12/2014 51 248.0 12.0 0.0 94 5/10/2014 49 233.0 12.0 0.0 95 5/7/2014 46 236.0 13.0 0.0 96 5/5/2014 44 235.0 13.0 0.0 97 5/3/2014 42 231.0 13.0 0.0 98 5/1/2014 40 226.0 13.0 0.0 99 4/26/2014 35 224.0 NaN 0.0 100 4/24/2014 33 NaN 35.0 0.0 101 4/23/2014 32 218.0 NaN 0.0 102 4/22/2014 31 NaN NaN 0.0 103 4/21/2014 30 NaN 34.0 NaN 104 4/20/2014 29 208.0 NaN NaN 105 4/17/2014 26 203.0 27.0 NaN 106 4/16/2014 25 197.0 27.0 NaN 107 4/15/2014 24 NaN NaN 12.0 108 4/14/2014 23 168.0 NaN NaN 109 4/11/2014 20 159.0 26.0 2.0 110 4/9/2014 18 158.0 25.0 2.0 111 4/7/2014 16 151.0 21.0 2.0 112 4/4/2014 13 143.0 18.0 2.0 113 4/1/2014 10 127.0 8.0 2.0 114 3/31/2014 9 122.0 8.0 2.0 115 3/29/2014 7 112.0 7.0 NaN 116 3/28/2014 6 112.0 3.0 2.0 117 3/27/2014 5 103.0 8.0 6.0 118 3/26/2014 4 86.0 NaN NaN 119 3/25/2014 3 86.0 NaN NaN 120 3/24/2014 2 86.0 NaN NaN 121 3/22/2014 0 49.0 NaN NaN Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN 5 NaN NaN NaN NaN 6 NaN NaN NaN NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN 10 NaN NaN NaN NaN 11 NaN NaN NaN NaN 12 NaN NaN NaN NaN 13 20.0 1.0 4.0 1.0 14 NaN NaN NaN NaN 15 20.0 1.0 4.0 1.0 16 NaN NaN NaN NaN 17 20.0 1.0 4.0 1.0 18 NaN NaN NaN NaN 19 20.0 1.0 4.0 1.0 20 20.0 1.0 4.0 1.0 21 NaN NaN NaN NaN 22 20.0 1.0 4.0 1.0 23 NaN NaN NaN NaN 24 20.0 1.0 4.0 1.0 25 NaN NaN NaN NaN 26 20.0 1.0 4.0 1.0 27 NaN NaN NaN NaN 28 20.0 1.0 4.0 1.0 29 NaN NaN NaN NaN .. ... ... ... ... 92 NaN NaN NaN NaN 93 NaN NaN NaN NaN 94 NaN NaN NaN NaN 95 NaN NaN NaN NaN 96 NaN NaN NaN NaN 97 NaN NaN NaN NaN 98 NaN NaN NaN NaN 99 NaN NaN NaN NaN 100 NaN NaN NaN NaN 101 NaN NaN NaN NaN 102 NaN NaN NaN NaN 103 NaN NaN NaN NaN 104 NaN NaN NaN NaN 105 NaN NaN NaN NaN 106 NaN NaN NaN NaN 107 NaN NaN NaN NaN 108 NaN NaN NaN NaN 109 NaN NaN NaN NaN 110 NaN NaN NaN NaN 111 NaN NaN NaN NaN 112 NaN NaN NaN NaN 113 NaN NaN NaN NaN 114 NaN NaN NaN NaN 115 NaN NaN NaN NaN 116 NaN NaN NaN NaN 117 NaN NaN NaN NaN 118 NaN NaN NaN NaN 119 NaN NaN NaN NaN 120 NaN NaN NaN NaN 121 NaN NaN NaN NaN Cases_Mali Deaths_Guinea Deaths_Liberia Deaths_SierraLeone \ 0 NaN 1786.0 NaN 2977.0 1 NaN 1781.0 NaN 2943.0 2 NaN 1767.0 3496.0 2915.0 3 NaN NaN 3496.0 NaN 4 NaN 1739.0 3471.0 2827.0 5 NaN 1708.0 3423.0 2758.0 6 NaN 1697.0 NaN 2732.0 7 NaN NaN 3413.0 2655.0 8 NaN 1607.0 NaN 2582.0 9 NaN 1586.0 3384.0 2556.0 10 NaN NaN 3376.0 NaN 11 NaN 1525.0 NaN 2085.0 12 NaN NaN 3290.0 NaN 13 7.0 1428.0 NaN 1768.0 14 NaN NaN 3177.0 NaN 15 7.0 1327.0 NaN 1583.0 16 NaN NaN 3145.0 NaN 17 7.0 1260.0 NaN 1398.0 18 NaN NaN 3016.0 NaN 19 6.0 1214.0 2963.0 1267.0 20 5.0 1192.0 NaN 1250.0 21 NaN NaN 2964.0 NaN 22 4.0 1166.0 NaN 1187.0 23 NaN NaN 2812.0 NaN 24 1.0 1142.0 NaN 1169.0 25 NaN NaN 2836.0 NaN 26 1.0 NaN 2766.0 1130.0 27 NaN 1054.0 NaN NaN 28 1.0 1041.0 NaN 1070.0 29 NaN NaN 2697.0 NaN .. ... ... ... ... 92 NaN 174.0 11.0 0.0 93 NaN 171.0 11.0 0.0 94 NaN 157.0 11.0 0.0 95 NaN 158.0 11.0 0.0 96 NaN 157.0 11.0 0.0 97 NaN 155.0 11.0 0.0 98 NaN 149.0 11.0 0.0 99 NaN 143.0 NaN 0.0 100 NaN NaN NaN 0.0 101 NaN 141.0 NaN 0.0 102 NaN NaN NaN 0.0 103 NaN NaN 11.0 NaN 104 NaN 136.0 6.0 NaN 105 NaN 129.0 NaN NaN 106 NaN 122.0 13.0 NaN 107 NaN NaN NaN NaN 108 NaN 108.0 NaN NaN 109 NaN 106.0 13.0 2.0 110 NaN 101.0 12.0 2.0 111 NaN 95.0 10.0 2.0 112 NaN 86.0 7.0 2.0 113 NaN 83.0 5.0 2.0 114 NaN 80.0 4.0 2.0 115 NaN 70.0 2.0 NaN 116 NaN 70.0 3.0 2.0 117 NaN 66.0 6.0 5.0 118 NaN 62.0 NaN NaN 119 NaN 60.0 NaN NaN 120 NaN 59.0 NaN NaN 121 NaN 29.0 NaN NaN Deaths_Nigeria Deaths_Senegal Deaths_UnitedStates Deaths_Spain \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN 5 NaN NaN NaN NaN 6 NaN NaN NaN NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN 10 NaN NaN NaN NaN 11 NaN NaN NaN NaN 12 NaN NaN NaN NaN 13 8.0 0.0 1.0 0.0 14 NaN NaN NaN NaN 15 8.0 0.0 1.0 0.0 16 NaN NaN NaN NaN 17 8.0 0.0 1.0 0.0 18 NaN NaN NaN NaN 19 8.0 0.0 1.0 0.0 20 8.0 0.0 1.0 0.0 21 NaN NaN NaN NaN 22 8.0 0.0 1.0 0.0 23 NaN NaN NaN NaN 24 8.0 0.0 1.0 0.0 25 NaN NaN NaN NaN 26 8.0 0.0 1.0 0.0 27 NaN NaN NaN NaN 28 8.0 0.0 1.0 0.0 29 NaN NaN NaN NaN .. ... ... ... ... 92 NaN NaN NaN NaN 93 NaN NaN NaN NaN 94 NaN NaN NaN NaN 95 NaN NaN NaN NaN 96 NaN NaN NaN NaN 97 NaN NaN NaN NaN 98 NaN NaN NaN NaN 99 NaN NaN NaN NaN 100 NaN NaN NaN NaN 101 NaN NaN NaN NaN 102 NaN NaN NaN NaN 103 NaN NaN NaN NaN 104 NaN NaN NaN NaN 105 NaN NaN NaN NaN 106 NaN NaN NaN NaN 107 NaN NaN NaN NaN 108 NaN NaN NaN NaN 109 NaN NaN NaN NaN 110 NaN NaN NaN NaN 111 NaN NaN NaN NaN 112 NaN NaN NaN NaN 113 NaN NaN NaN NaN 114 NaN NaN NaN NaN 115 NaN NaN NaN NaN 116 NaN NaN NaN NaN 117 NaN NaN NaN NaN 118 NaN NaN NaN NaN 119 NaN NaN NaN NaN 120 NaN NaN NaN NaN 121 NaN NaN NaN NaN Deaths_Mali 0 NaN 1 NaN 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 NaN 9 NaN 10 NaN 11 NaN 12 NaN 13 6.0 14 NaN 15 6.0 16 NaN 17 6.0 18 NaN 19 6.0 20 5.0 21 NaN 22 3.0 23 NaN 24 1.0 25 NaN 26 1.0 27 NaN 28 1.0 29 NaN .. ... 92 NaN 93 NaN 94 NaN 95 NaN 96 NaN 97 NaN 98 NaN 99 NaN 100 NaN 101 NaN 102 NaN 103 NaN 104 NaN 105 NaN 106 NaN 107 NaN 108 NaN 109 NaN 110 NaN 111 NaN 112 NaN 113 NaN 114 NaN 115 NaN 116 NaN 117 NaN 118 NaN 119 NaN 120 NaN 121 NaN [122 rows x 18 columns]
4.1.7 tidy operation step 1: Series.str + melt = get new column
Series.str
can be used to access the values of the series as strings and apply several methods to it. These can be accessed like Series.str.<function/property>
.
Series.str
make you handle Series like a string, BUUUUT, it's still a Series-like thing.
.
Series
—.str—>StringMethods
—<.strmethod>—>Series
. NOT a String .split() . .swapcase() . .upper() . .lower() . .get(num)Note that,
Series.str.split()
is still a Series, although it has a list of string inside, you can get each string bySeries.str.split().str.get(index)
ebola_tidy = pd.melt(ebola_non_tidy, id_vars=['Date', 'Day'], var_name='cd_country', value_name='count') splitted_contry_list = ebola_tidy['cd_country'].str.split('_') # handle Series like a string, it's still a Series print ( type (ebola_tidy['cd_country'].str )) print ( type (splitted_contry_list )) # type is Series print ( splitted_contry_list ) status_values = splitted_contry_list.str.get(0) country_values = splitted_contry_list.str.get(1) print ( type(status_values) )
<class 'pandas.core.strings.StringMethods'> <class 'pandas.core.series.Series'> 0 [Cases, Guinea] 1 [Cases, Guinea] 2 [Cases, Guinea] 3 [Cases, Guinea] 4 [Cases, Guinea] 5 [Cases, Guinea] 6 [Cases, Guinea] 7 [Cases, Guinea] 8 [Cases, Guinea] 9 [Cases, Guinea] 10 [Cases, Guinea] 11 [Cases, Guinea] 12 [Cases, Guinea] 13 [Cases, Guinea] 14 [Cases, Guinea] 15 [Cases, Guinea] 16 [Cases, Guinea] 17 [Cases, Guinea] 18 [Cases, Guinea] 19 [Cases, Guinea] 20 [Cases, Guinea] 21 [Cases, Guinea] 22 [Cases, Guinea] 23 [Cases, Guinea] 24 [Cases, Guinea] 25 [Cases, Guinea] 26 [Cases, Guinea] 27 [Cases, Guinea] 28 [Cases, Guinea] 29 [Cases, Guinea] ... 1922 [Deaths, Mali] 1923 [Deaths, Mali] 1924 [Deaths, Mali] 1925 [Deaths, Mali] 1926 [Deaths, Mali] 1927 [Deaths, Mali] 1928 [Deaths, Mali] 1929 [Deaths, Mali] 1930 [Deaths, Mali] 1931 [Deaths, Mali] 1932 [Deaths, Mali] 1933 [Deaths, Mali] 1934 [Deaths, Mali] 1935 [Deaths, Mali] 1936 [Deaths, Mali] 1937 [Deaths, Mali] 1938 [Deaths, Mali] 1939 [Deaths, Mali] 1940 [Deaths, Mali] 1941 [Deaths, Mali] 1942 [Deaths, Mali] 1943 [Deaths, Mali] 1944 [Deaths, Mali] 1945 [Deaths, Mali] 1946 [Deaths, Mali] 1947 [Deaths, Mali] 1948 [Deaths, Mali] 1949 [Deaths, Mali] 1950 [Deaths, Mali] 1951 [Deaths, Mali] Name: cd_country, Length: 1952, dtype: object <class 'pandas.core.series.Series'>
4.1.8 tidy operation step 2: assign new column to original dataframe
- add this Series as a new column to original dataframe:
df['columnName'] = Series
ebola_tidy['status'] = status_values ebola_tidy['country'] = country_values print (ebola_tidy)
Date Day cd_country count status country 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea 5 12/28/2014 281 Cases_Guinea 2706.0 Cases Guinea 6 12/27/2014 280 Cases_Guinea 2695.0 Cases Guinea 7 12/24/2014 277 Cases_Guinea 2630.0 Cases Guinea 8 12/21/2014 273 Cases_Guinea 2597.0 Cases Guinea 9 12/20/2014 272 Cases_Guinea 2571.0 Cases Guinea 10 12/18/2014 271 Cases_Guinea NaN Cases Guinea 11 12/14/2014 267 Cases_Guinea 2416.0 Cases Guinea 12 12/9/2014 262 Cases_Guinea NaN Cases Guinea 13 12/7/2014 260 Cases_Guinea 2292.0 Cases Guinea 14 12/3/2014 256 Cases_Guinea NaN Cases Guinea 15 11/30/2014 253 Cases_Guinea 2164.0 Cases Guinea 16 11/28/2014 251 Cases_Guinea NaN Cases Guinea 17 11/23/2014 246 Cases_Guinea 2134.0 Cases Guinea 18 11/22/2014 245 Cases_Guinea NaN Cases Guinea 19 11/18/2014 241 Cases_Guinea 2047.0 Cases Guinea 20 11/16/2014 239 Cases_Guinea 1971.0 Cases Guinea 21 11/15/2014 238 Cases_Guinea NaN Cases Guinea 22 11/11/2014 234 Cases_Guinea 1919.0 Cases Guinea 23 11/10/2014 233 Cases_Guinea NaN Cases Guinea 24 11/9/2014 232 Cases_Guinea 1878.0 Cases Guinea 25 11/8/2014 231 Cases_Guinea NaN Cases Guinea 26 11/4/2014 227 Cases_Guinea NaN Cases Guinea 27 11/3/2014 226 Cases_Guinea 1760.0 Cases Guinea 28 11/2/2014 225 Cases_Guinea 1731.0 Cases Guinea 29 10/31/2014 222 Cases_Guinea NaN Cases Guinea ... ... ... ... ... ... 1922 5/23/2014 62 Deaths_Mali NaN Deaths Mali 1923 5/12/2014 51 Deaths_Mali NaN Deaths Mali 1924 5/10/2014 49 Deaths_Mali NaN Deaths Mali 1925 5/7/2014 46 Deaths_Mali NaN Deaths Mali 1926 5/5/2014 44 Deaths_Mali NaN Deaths Mali 1927 5/3/2014 42 Deaths_Mali NaN Deaths Mali 1928 5/1/2014 40 Deaths_Mali NaN Deaths Mali 1929 4/26/2014 35 Deaths_Mali NaN Deaths Mali 1930 4/24/2014 33 Deaths_Mali NaN Deaths Mali 1931 4/23/2014 32 Deaths_Mali NaN Deaths Mali 1932 4/22/2014 31 Deaths_Mali NaN Deaths Mali 1933 4/21/2014 30 Deaths_Mali NaN Deaths Mali 1934 4/20/2014 29 Deaths_Mali NaN Deaths Mali 1935 4/17/2014 26 Deaths_Mali NaN Deaths Mali 1936 4/16/2014 25 Deaths_Mali NaN Deaths Mali 1937 4/15/2014 24 Deaths_Mali NaN Deaths Mali 1938 4/14/2014 23 Deaths_Mali NaN Deaths Mali 1939 4/11/2014 20 Deaths_Mali NaN Deaths Mali 1940 4/9/2014 18 Deaths_Mali NaN Deaths Mali 1941 4/7/2014 16 Deaths_Mali NaN Deaths Mali 1942 4/4/2014 13 Deaths_Mali NaN Deaths Mali 1943 4/1/2014 10 Deaths_Mali NaN Deaths Mali 1944 3/31/2014 9 Deaths_Mali NaN Deaths Mali 1945 3/29/2014 7 Deaths_Mali NaN Deaths Mali 1946 3/28/2014 6 Deaths_Mali NaN Deaths Mali 1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali 1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali 1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali 1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali 1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali [1952 rows x 6 columns]
4.1.9 Series.str model
4.1.10 a general method using series.str
+ df.melt
A general method to use Series.str
is :
- handle Series as a String to get a new better Seires:
Series
—.str—>StringMethods
—<.strmethod>—>Series
add this Series as a new column to original dataframe:
df['columnName'] = Series
. df [column] .str -----stringLikeMethod-----> new Series . | . | . v . df ['giveName'] = _____
4.1.11 combine multiple .str
model expressions
Series.str
can be used to access the values of the series as strings
and apply several methods to it.
These can be accessed like Series.str.<function/property>
.
you can combine multiple expression, because the head and tail of a
.str
model are both Series
object.
. ~Series~ ---.str---> ~StringMethods~ ---[.strmethod]---> ~Series~ ---.str---> ~StringMethods~ etc. . | *NOT a String* .split() | . | .swapcase() | . v .upper() v . *head* .lower() *tail* and another *head* . .get(num) Note that, ~Series.str.split()~ is still a Series, although it has *a list of string inside*, you can get each string by ~Series.str.split().str.get(index)~
4.1.12 .str.split('_', expand=True)
method | return |
---|---|
.str.split('_') | Series |
.str.split('_', expand=True) | DataFrame |
split('_', expand=True)
: get a DataFrame
with two separate string inside, each as a column
0 1 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea 3 Cases Guinea 4 Cases Guinea 5 Cases Guinea 6 Cases Guinea 7 Cases Guinea 8 Cases Guinea
split('_')
: get a Series
with list of strings inside, whole as a column
0 [Cases, Guinea] 1 [Cases, Guinea] 2 [Cases, Guinea] 3 [Cases, Guinea] 4 [Cases, Guinea] 5 [Cases, Guinea] 6 [Cases, Guinea] 7 [Cases, Guinea] 8 [Cases, Guinea]
# variable_split = ebola_tidy['cd_country'].str.split('_') variable_split = ebola_tidy['cd_country'].str.split('_', expand=True) print ( type(variable_split) ) # print (variable_split) variable_split.columns = ['status1', 'country1'] # print (variable_split) print ( variable_split.head() ) print ( ebola_tidy.head() ) ebola_clean1 = pd.concat([ebola_tidy, variable_split], axis=1) ebola_clean2 = pd.concat([ebola_tidy, variable_split]) print ( ebola_clean1.head(10) ) print ( ebola_clean1.tail(10) ) print ( ebola_clean2.head(10) ) print ( ebola_clean2.tail(10) )
<class 'pandas.core.frame.DataFrame'> status1 country1 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea 3 Cases Guinea 4 Cases Guinea Date Day cd_country count status country 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea Date Day cd_country count status country status1 country1 0 1/5/2015 289 Cases_Guinea 2776.0 Cases Guinea Cases Guinea 1 1/4/2015 288 Cases_Guinea 2775.0 Cases Guinea Cases Guinea 2 1/3/2015 287 Cases_Guinea 2769.0 Cases Guinea Cases Guinea 3 1/2/2015 286 Cases_Guinea NaN Cases Guinea Cases Guinea 4 12/31/2014 284 Cases_Guinea 2730.0 Cases Guinea Cases Guinea 5 12/28/2014 281 Cases_Guinea 2706.0 Cases Guinea Cases Guinea 6 12/27/2014 280 Cases_Guinea 2695.0 Cases Guinea Cases Guinea 7 12/24/2014 277 Cases_Guinea 2630.0 Cases Guinea Cases Guinea 8 12/21/2014 273 Cases_Guinea 2597.0 Cases Guinea Cases Guinea 9 12/20/2014 272 Cases_Guinea 2571.0 Cases Guinea Cases Guinea Date Day cd_country count status country status1 country1 1942 4/4/2014 13 Deaths_Mali NaN Deaths Mali Deaths Mali 1943 4/1/2014 10 Deaths_Mali NaN Deaths Mali Deaths Mali 1944 3/31/2014 9 Deaths_Mali NaN Deaths Mali Deaths Mali 1945 3/29/2014 7 Deaths_Mali NaN Deaths Mali Deaths Mali 1946 3/28/2014 6 Deaths_Mali NaN Deaths Mali Deaths Mali 1947 3/27/2014 5 Deaths_Mali NaN Deaths Mali Deaths Mali 1948 3/26/2014 4 Deaths_Mali NaN Deaths Mali Deaths Mali 1949 3/25/2014 3 Deaths_Mali NaN Deaths Mali Deaths Mali 1950 3/24/2014 2 Deaths_Mali NaN Deaths Mali Deaths Mali 1951 3/22/2014 0 Deaths_Mali NaN Deaths Mali Deaths Mali Date Day cd_country count country country1 status status1 0 1/5/2015 289.0 Cases_Guinea 2776.0 Guinea NaN Cases NaN 1 1/4/2015 288.0 Cases_Guinea 2775.0 Guinea NaN Cases NaN 2 1/3/2015 287.0 Cases_Guinea 2769.0 Guinea NaN Cases NaN 3 1/2/2015 286.0 Cases_Guinea NaN Guinea NaN Cases NaN 4 12/31/2014 284.0 Cases_Guinea 2730.0 Guinea NaN Cases NaN 5 12/28/2014 281.0 Cases_Guinea 2706.0 Guinea NaN Cases NaN 6 12/27/2014 280.0 Cases_Guinea 2695.0 Guinea NaN Cases NaN 7 12/24/2014 277.0 Cases_Guinea 2630.0 Guinea NaN Cases NaN 8 12/21/2014 273.0 Cases_Guinea 2597.0 Guinea NaN Cases NaN 9 12/20/2014 272.0 Cases_Guinea 2571.0 Guinea NaN Cases NaN Date Day cd_country count country country1 status status1 1942 NaN NaN NaN NaN NaN Mali NaN Deaths 1943 NaN NaN NaN NaN NaN Mali NaN Deaths 1944 NaN NaN NaN NaN NaN Mali NaN Deaths 1945 NaN NaN NaN NaN NaN Mali NaN Deaths 1946 NaN NaN NaN NaN NaN Mali NaN Deaths 1947 NaN NaN NaN NaN NaN Mali NaN Deaths 1948 NaN NaN NaN NaN NaN Mali NaN Deaths 1949 NaN NaN NaN NaN NaN Mali NaN Deaths 1950 NaN NaN NaN NaN NaN Mali NaN Deaths 1951 NaN NaN NaN NaN NaN Mali NaN Deaths
4.1.13 how to rename the column name of a dataframe
0 1 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea 3 Cases Guinea 4 Cases Guinea 5 Cases Guinea
variable_split.columns = ['status1', 'country1']
status1 country1 0 Cases Guinea 1 Cases Guinea 2 Cases Guinea 3 Cases Guinea 4 Cases Guinea
4.1.14 non-tidy, exp4: enumerate all days in a month as column name
this dataset list all days from d1 to d31 as column name for each month, not tidy
id year month element d1 d2 d3 d4 d5 d6 ... \ 0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN ... 1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN ... 2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN ... 3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN ...
we should clean the format to make the
- melt all columns d1 to d31 as values of column day
- tmax and tmin the column name
4.1.15 0. read in dataset to be a dataframe
weather = pd.read_csv(file_path + 'weather.csv') print ( weather )
id year month element d1 d2 d3 d4 d5 d6 ... \ 0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN ... 1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN ... 2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN ... 3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN ... 4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN ... 5 MX17004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN ... 6 MX17004 2010 4 tmax NaN NaN NaN NaN NaN NaN ... 7 MX17004 2010 4 tmin NaN NaN NaN NaN NaN NaN ... 8 MX17004 2010 5 tmax NaN NaN NaN NaN NaN NaN ... 9 MX17004 2010 5 tmin NaN NaN NaN NaN NaN NaN ... 10 MX17004 2010 6 tmax NaN NaN NaN NaN NaN NaN ... 11 MX17004 2010 6 tmin NaN NaN NaN NaN NaN NaN ... 12 MX17004 2010 7 tmax NaN NaN 28.6 NaN NaN NaN ... 13 MX17004 2010 7 tmin NaN NaN 17.5 NaN NaN NaN ... 14 MX17004 2010 8 tmax NaN NaN NaN NaN 29.6 NaN ... 15 MX17004 2010 8 tmin NaN NaN NaN NaN 15.8 NaN ... 16 MX17004 2010 10 tmax NaN NaN NaN NaN 27.0 NaN ... 17 MX17004 2010 10 tmin NaN NaN NaN NaN 14.0 NaN ... 18 MX17004 2010 11 tmax NaN 31.3 NaN 27.2 26.3 NaN ... 19 MX17004 2010 11 tmin NaN 16.3 NaN 12.0 7.9 NaN ... 20 MX17004 2010 12 tmax 29.9 NaN NaN NaN NaN 27.8 ... 21 MX17004 2010 12 tmin 13.8 NaN NaN NaN NaN 10.5 ... d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 0 NaN NaN NaN NaN NaN NaN NaN NaN 27.8 NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN 14.5 NaN 2 NaN 29.9 NaN NaN NaN NaN NaN NaN NaN NaN 3 NaN 10.7 NaN NaN NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6 NaN NaN NaN NaN NaN 36.3 NaN NaN NaN NaN 7 NaN NaN NaN NaN NaN 16.7 NaN NaN NaN NaN 8 NaN NaN NaN NaN NaN 33.2 NaN NaN NaN NaN 9 NaN NaN NaN NaN NaN 18.2 NaN NaN NaN NaN 10 NaN NaN NaN NaN NaN NaN NaN 30.1 NaN NaN 11 NaN NaN NaN NaN NaN NaN NaN 18.0 NaN NaN 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 14 NaN 26.4 NaN 29.7 NaN NaN NaN 28.0 NaN 25.4 15 NaN 15.0 NaN 15.6 NaN NaN NaN 15.3 NaN 15.4 16 NaN NaN NaN NaN NaN NaN 31.2 NaN NaN NaN 17 NaN NaN NaN NaN NaN NaN 15.0 NaN NaN NaN 18 NaN NaN NaN NaN 28.1 27.7 NaN NaN NaN NaN 19 NaN NaN NaN NaN 12.1 14.2 NaN NaN NaN NaN 20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 21 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN [22 rows x 35 columns]
4.1.16 1. melt all columns d1 to d31 as values of column day
weather_melt = pd.melt(weather, id_vars=['id', 'year', 'month', 'element'], var_name='day', value_name='temp') print ( weather_melt.head() )
id year month element day temp 0 MX17004 2010 1 tmax d1 NaN 1 MX17004 2010 1 tmin d1 NaN 2 MX17004 2010 2 tmax d1 NaN 3 MX17004 2010 2 tmin d1 NaN 4 MX17004 2010 3 tmax d1 NaN
4.1.17 2. tmax and tmin the column name
para index
of pd.pivot_table()
is like the para id_vars
of pd.melt()
, specifying the column you don't want change
weather_tidy = weather_melt.pivot_table(index=['id', 'year', 'month', 'day'], columns='element', values='temp').reset_index() print ( weather_tidy )
element id year month day tmax tmin 0 MX17004 2010 1 d1 NaN NaN 1 MX17004 2010 1 d10 NaN NaN 2 MX17004 2010 1 d11 NaN NaN 3 MX17004 2010 1 d12 NaN NaN 4 MX17004 2010 1 d13 NaN NaN 5 MX17004 2010 1 d14 NaN NaN 6 MX17004 2010 1 d15 NaN NaN 7 MX17004 2010 1 d16 NaN NaN 8 MX17004 2010 1 d17 NaN NaN 9 MX17004 2010 1 d18 NaN NaN 10 MX17004 2010 1 d19 NaN NaN 11 MX17004 2010 1 d2 NaN NaN 12 MX17004 2010 1 d20 NaN NaN 13 MX17004 2010 1 d21 NaN NaN 14 MX17004 2010 1 d22 NaN NaN 15 MX17004 2010 1 d23 NaN NaN 16 MX17004 2010 1 d24 NaN NaN 17 MX17004 2010 1 d25 NaN NaN 18 MX17004 2010 1 d26 NaN NaN 19 MX17004 2010 1 d27 NaN NaN 20 MX17004 2010 1 d28 NaN NaN 21 MX17004 2010 1 d29 NaN NaN 22 MX17004 2010 1 d3 NaN NaN 23 MX17004 2010 1 d30 27.8 14.5 24 MX17004 2010 1 d31 NaN NaN 25 MX17004 2010 1 d4 NaN NaN 26 MX17004 2010 1 d5 NaN NaN 27 MX17004 2010 1 d6 NaN NaN 28 MX17004 2010 1 d7 NaN NaN 29 MX17004 2010 1 d8 NaN NaN .. ... ... ... ... ... ... 311 MX17004 2010 12 d10 NaN NaN 312 MX17004 2010 12 d11 NaN NaN 313 MX17004 2010 12 d12 NaN NaN 314 MX17004 2010 12 d13 NaN NaN 315 MX17004 2010 12 d14 NaN NaN 316 MX17004 2010 12 d15 NaN NaN 317 MX17004 2010 12 d16 NaN NaN 318 MX17004 2010 12 d17 NaN NaN 319 MX17004 2010 12 d18 NaN NaN 320 MX17004 2010 12 d19 NaN NaN 321 MX17004 2010 12 d2 NaN NaN 322 MX17004 2010 12 d20 NaN NaN 323 MX17004 2010 12 d21 NaN NaN 324 MX17004 2010 12 d22 NaN NaN 325 MX17004 2010 12 d23 NaN NaN 326 MX17004 2010 12 d24 NaN NaN 327 MX17004 2010 12 d25 NaN NaN 328 MX17004 2010 12 d26 NaN NaN 329 MX17004 2010 12 d27 NaN NaN 330 MX17004 2010 12 d28 NaN NaN 331 MX17004 2010 12 d29 NaN NaN 332 MX17004 2010 12 d3 NaN NaN 333 MX17004 2010 12 d30 NaN NaN 334 MX17004 2010 12 d31 NaN NaN 335 MX17004 2010 12 d4 NaN NaN 336 MX17004 2010 12 d5 NaN NaN 337 MX17004 2010 12 d6 27.8 10.5 338 MX17004 2010 12 d7 NaN NaN 339 MX17004 2010 12 d8 NaN NaN 340 MX17004 2010 12 d9 NaN NaN [341 rows x 6 columns]
4.1.18 what is pd.pivot_table
pivot_table
is some reverse process of melt
:
- you choose the column as para
columns
ofpivot_table
as the pivot, which means create new columns according to all different values of this column. eg. column element has 2 different values 'tmax' and 'tmin', so you add two columns 'tmax' and 'tmin'. - you choose the column as para
values
ofpivot_table
as the values of new column from 1st step. eg. column temp, all of value of this column should be the value of the new column 'tmax' and 'tmin'
pivot_table(index=['id', 'year', 'month', 'day'], columns='element', values='temp')
. id year month day temp element . 0 MX17004 2010 1 d1 NaN---+ tmax . 1 MX17004 2010 1 d1 NaN---|------+ tmin . 2 MX17004 2010 2 d1 NaN | | tmax . 3 MX17004 2010 2 d1 NaN | | tmin . 4 MX17004 2010 3 d1 NaN | | tmax . | | . | | . id year month day tmax | | tmin . 0 MX17004 2010 1 d1 NaN<-| |-> NaN . 1 MX17004 2010 1 d10 NaN NaN . 2 MX17004 2010 1 d11 NaN NaN . 3 MX17004 2010 1 d12 NaN NaN . 4 MX17004 2010 1 d13 NaN NaN . 5 MX17004 2010 1 d14 NaN NaN . 6 MX17004 2010 1 d15 NaN NaN . 7 MX17004 2010 1 d16 NaN NaN . 8 MX17004 2010 1 d17 NaN NaN . 9 MX17004 2010 1 d18 NaN NaN
Note that,
- pivot_table makes 2 columns disappear, from one of 2 extract the new columns, from the other extract the new values of new columns.
- pivot_table makes many columns be created, the exact number decided by how many different values in the old column which you extract new columns.
4.1.19 pivot_table and melt
These two methods are the most important tools in handling the tidy-data. They can solve almost 80% of your data cleaning problems.
5 Data types, part-V
5.1 intro Category data type
category in result of df.info()
is a data type, looks like a String
type, but essentially vary hugely.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 7 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null *category* <<< smoker 244 non-null *category* <<< day 244 non-null *category* <<< time 244 non-null *category* <<< size 244 non-null int64 dtypes: category(4), float64(2), int64(1) memory usage: 7.2 KB
category data type will save a lot of memory.
import seaborn as sns tips = sns.load_dataset('tips') print ( tips.head() ) print ( tips.info() )
/tmp/babel-250413Wz/python-25041w1n in <module>() ----> 1 import seaborn as sns 2 tips = sns.load_dataset('tips') 3 print ( tips.head() ) 4 print ( tips.info() ) ModuleNotFoundError: No module named 'seaborn'
5.1.1 type conversion when do operations on column
<<<< total_bill *float64* change to >>>> total_bill *object*
how dose this happen.
tips_sub_miss.loc[[1, 3, 5, 6], 'total_bill'] = 'missing'
because you modify some value of `total_bill` from double to String, so this
column will make itself the lowest comman type: object
# pd.Categorical() tips['total_bill_str'] = tips['total_bill'].astype(str) # type conversion happen tips.info() tips_sub_miss = tips.head(10) tips_sub_miss.loc[[1, 3, 5, 6], 'total_bill'] = 'missing' print ( tips_sub_miss ) print ( tips_sub_miss.info() )
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244 entries, 0 to 243 Data columns (total 8 columns): total_bill 244 non-null float64 tip 244 non-null float64 sex 244 non-null category smoker 244 non-null category day 244 non-null category time 244 non-null category size 244 non-null int64 total_bill_str 244 non-null object dtypes: category(4), float64(2), int64(1), object(1) memory usage: 9.1+ KB total_bill tip sex smoker day time size total_bill_str 0 16.99 1.01 Female No Sun Dinner 2 16.99 1 missing 1.66 Male No Sun Dinner 3 10.34 2 21.01 3.50 Male No Sun Dinner 3 21.01 3 missing 3.31 Male No Sun Dinner 2 23.68 4 24.59 3.61 Female No Sun Dinner 4 24.59 5 missing 4.71 Male No Sun Dinner 4 25.29 6 missing 2.00 Male No Sun Dinner 2 8.77 7 26.88 3.12 Male No Sun Dinner 4 26.88 8 15.04 1.96 Male No Sun Dinner 2 15.04 9 14.78 3.23 Male No Sun Dinner 2 14.78 <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 8 columns): total_bill 10 non-null object tip 10 non-null float64 sex 10 non-null category smoker 10 non-null category day 10 non-null category time 10 non-null category size 10 non-null int64 total_bill_str 10 non-null object dtypes: category(4), float64(1), int64(1), object(2) memory usage: 920.0+ bytes None
5.1.2 how to do type conversion on column
series.astype(<type name>)
pd.to_numeric(seriesRef, errors='3 choices'))
errors = 'raise'
, raise the errors when encount some uncompatible conversionserrors = 'ignore'
, when encounting mistake, just stop conversion and don't raise errorserrors = 'coerse'
, do conversion despite whatever happen
# pd.to_numeric( tips_sub_miss['total_bill'], errors='raise') # pd.to_numeric( tips_sub_miss['total_bill'], errors='ignore') xxx = pd.to_numeric( tips_sub_miss['total_bill'], errors='coerce') print ( xxx )
0 16.99 1 NaN 2 21.01 3 NaN 4 24.59 5 NaN 6 NaN 7 26.88 8 15.04 9 14.78 Name: total_bill, dtype: float64