Pandas-笔记2
Table of Contents
- 1. Apply, part-VI
- 1.1. define some complex computation
- 1.2. Titanic data statistics
- 1.2.1. read in data from seaborn dataset
- 1.2.2. sum up the missing value of each column
- 1.2.3. filter out the rows which has a non_NaN value in column 'embark_town'
- 1.2.4. sum up the missing value of each row
- 1.2.5. sum up the number of differnt numbers of each row
- 1.2.6. sum up the prop_complete of differnt numbers of each row
- 1.3. pandas filter vs. numpy filter
- 1.4. how to count the number of "True" of a boolean list
- 1.5. how to count the number of specified value in a column
- 1.6. pd.isnull() => boolean Series
- 1.7. do some exercise, for fun
1 Apply, part-VI
Apply, this will happens at the time like:
- if you're pivoting tables, but that's not enough to clean your data.
- for each row of your data, you want to perform a more complex calculation, other than adding two columns or calculate the mean of the row.
Then, you can def
the function you want, then using apply(funcName)
on
Series(row/column)
Note that, mathOperation(Series) => Series
is very important for
combination, it keeps entry-type and exit-type the same.
1.1 define some complex computation
import pandas as pd def my_function(): # indent 4 spaces # function code pass def my_sq(x): """squares a given value """ return x**2 def avg_2(x, y): """calcuate average of 2 numbers """ return (x+y)/2 print ( my_sq(2) ) print ( my_sq(4) ) print ( avg_2(4, 9) )
4 16 6.5
1.1.1 mathematical operation on Series is element-wise
df = pd.DataFrame({ 'a' : [10, 20, 30], 'b' : [20, 30, 40] }) print ( df ) print ( df['a'] ** 2) ## element-wise squaring
a b 0 10 20 1 20 30 2 30 40 0 100 1 400 2 900 Name: a, dtype: int64
1.1.2 apply user-defined function on Series
Apply on the Series is default element-wise
operation.
. | a | . +----+ . | 10 | . apply ( my_sq ) ---> my_sq( 10 ) , my_sq( 20 ) , my_sq( 30 ) . | 20 | . | 30 | . \---------/ \---------/ \---------/ . | | | . v | | . 100.0 | | . a new Series 400.0 <----------+ | . with shape = (3,) 900.0 <-------------------------+
print ( df['a'] ) print ( df['a'].apply(my_sq)) ## apply function ~my_sq~ element-wise
0 10 1 20 2 30 Name: a, dtype: int64 0 100 1 400 2 900 Name: a, dtype: int64
1.1.3 apply user-defined function on DataFrame, column-wise
apply on the DataFrame is default column-wise
operation.
What actually happends when you do apply
on an df
, it's default apply
the function on each column, means the column is passed into the 1st
argument of the function.
. | | a | b | | a | | b | . +---+----+----+ +----+ +----+ . | 0 | 10 | 20 | . apply ( avg_3 ) ---> avg_3( | 10 | ) , avg_3( | 20 | ) . | 1 | 20 | 30 | | 20 | | 30 | . | 2 | 30 | 40 | | 30 | | 40 | . \-------------/ \-------------/ . | | . v | . 20.0 | . a new Series 30.0 <--------------------+ . with shape = (2,)
def print_me(x): print(x) df.apply(print_me)
0 10 1 20 2 30 Name: a, dtype: int64 0 20 1 30 2 40 Name: b, dtype: int64
1.1.4 xxx
def avg_3(x, y, z): """avg of 3 numbers """ return (x+y+z)/3 df.apply(avg_3) # wrong
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/tmp/babel-32610Fla/python-32610oRZ", line 6, in <module> df.apply(avg_3) # wrong File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply ignore_failures=ignore_failures) File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard results[i] = func(v) TypeError: ("avg_3() missing 2 required positional arguments: 'y' and 'z'", 'occurred at index a')
This caused some error, what actually happends when you do apply
on an df
, it's default apply the function on each column,
means the column is passed into the 1st argument of the function.
. | | a | b | | a | | b | . +---+----+----+ +----+ +----+ . | 0 | 10 | 20 | . apply ( avg_3 ) ---> avg_3( | 10 | ) , avg_3( | 20 | ) . | 1 | 20 | 30 | | 20 | | 30 | . | 2 | 30 | 40 | | 30 | | 40 | . \-------------/ \-------------/ . | | . v | . 20.0 | . a new Series 30.0 <--------------------+ . with shape = (2,)
def avg_3(col): """avg of 3 numbers """ x=col[0] y=col[1] z=col[2] return (x+y+z)/3 avg_result = df.apply(avg_3) # right print ( df ) print ( avg_result ) print ( avg_result.shape ) print ( type(avg_result) )
a b 0 10 20 1 20 30 2 30 40 a 20.0 b 30.0 dtype: float64 (2,) <class 'pandas.core.series.Series'>
1.1.5 apply user-defined function on DataFrame, row-wise
add parameter axis = 1
in the apply
will give you the row-wise operation.
def avg_2(row): """avg of 3 numbers """ x=row[0] y=row[1] return (x+y)/2 row_avg = df.apply(avg_2, axis=1) print ( df ) print ( row_avg )
a b 0 10 20 1 20 30 2 30 40 0 15.0 1 25.0 2 35.0 dtype: float64
1.2 Titanic data statistics
1.2.1 read in data from seaborn dataset
import seaborn as sns titanic = sns.load_dataset('titanic') print ( titanic ) titanic.info()
survived pclass sex age sibsp parch fare embarked class \ 0 0 3 male 22.0 1 0 7.2500 S Third 1 1 1 female 38.0 1 0 71.2833 C First 2 1 3 female 26.0 0 0 7.9250 S Third 3 1 1 female 35.0 1 0 53.1000 S First 4 0 3 male 35.0 0 0 8.0500 S Third 5 0 3 male NaN 0 0 8.4583 Q Third 6 0 1 male 54.0 0 0 51.8625 S First 7 0 3 male 2.0 3 1 21.0750 S Third 8 1 3 female 27.0 0 2 11.1333 S Third 9 1 2 female 14.0 1 0 30.0708 C Second 10 1 3 female 4.0 1 1 16.7000 S Third 11 1 1 female 58.0 0 0 26.5500 S First 12 0 3 male 20.0 0 0 8.0500 S Third 13 0 3 male 39.0 1 5 31.2750 S Third 14 0 3 female 14.0 0 0 7.8542 S Third 15 1 2 female 55.0 0 0 16.0000 S Second 16 0 3 male 2.0 4 1 29.1250 Q Third 17 1 2 male NaN 0 0 13.0000 S Second 18 0 3 female 31.0 1 0 18.0000 S Third 19 1 3 female NaN 0 0 7.2250 C Third 20 0 2 male 35.0 0 0 26.0000 S Second 21 1 2 male 34.0 0 0 13.0000 S Second 22 1 3 female 15.0 0 0 8.0292 Q Third 23 1 1 male 28.0 0 0 35.5000 S First 24 0 3 female 8.0 3 1 21.0750 S Third 25 1 3 female 38.0 1 5 31.3875 S Third 26 0 3 male NaN 0 0 7.2250 C Third 27 0 1 male 19.0 3 2 263.0000 S First 28 1 3 female NaN 0 0 7.8792 Q Third 29 0 3 male NaN 0 0 7.8958 S Third .. ... ... ... ... ... ... ... ... ... 861 0 2 male 21.0 1 0 11.5000 S Second 862 1 1 female 48.0 0 0 25.9292 S First 863 0 3 female NaN 8 2 69.5500 S Third 864 0 2 male 24.0 0 0 13.0000 S Second 865 1 2 female 42.0 0 0 13.0000 S Second 866 1 2 female 27.0 1 0 13.8583 C Second 867 0 1 male 31.0 0 0 50.4958 S First 868 0 3 male NaN 0 0 9.5000 S Third 869 1 3 male 4.0 1 1 11.1333 S Third 870 0 3 male 26.0 0 0 7.8958 S Third 871 1 1 female 47.0 1 1 52.5542 S First 872 0 1 male 33.0 0 0 5.0000 S First 873 0 3 male 47.0 0 0 9.0000 S Third 874 1 2 female 28.0 1 0 24.0000 C Second 875 1 3 female 15.0 0 0 7.2250 C Third 876 0 3 male 20.0 0 0 9.8458 S Third 877 0 3 male 19.0 0 0 7.8958 S Third 878 0 3 male NaN 0 0 7.8958 S Third 879 1 1 female 56.0 0 1 83.1583 C First 880 1 2 female 25.0 0 1 26.0000 S Second 881 0 3 male 33.0 0 0 7.8958 S Third 882 0 3 female 22.0 0 0 10.5167 S Third 883 0 2 male 28.0 0 0 10.5000 S Second 884 0 3 male 25.0 0 0 7.0500 S Third 885 0 3 female 39.0 0 5 29.1250 Q Third 886 0 2 male 27.0 0 0 13.0000 S Second 887 1 1 female 19.0 0 0 30.0000 S First 888 0 3 female NaN 1 2 23.4500 S Third 889 1 1 male 26.0 0 0 30.0000 C First 890 0 3 male 32.0 0 0 7.7500 Q Third who adult_male deck embark_town alive alone 0 man True NaN Southampton no False 1 woman False C Cherbourg yes False 2 woman False NaN Southampton yes True 3 woman False C Southampton yes False 4 man True NaN Southampton no True 5 man True NaN Queenstown no True 6 man True E Southampton no True 7 child False NaN Southampton no False 8 woman False NaN Southampton yes False 9 child False NaN Cherbourg yes False 10 child False G Southampton yes False 11 woman False C Southampton yes True 12 man True NaN Southampton no True 13 man True NaN Southampton no False 14 child False NaN Southampton no True 15 woman False NaN Southampton yes True 16 child False NaN Queenstown no False 17 man True NaN Southampton yes True 18 woman False NaN Southampton no False 19 woman False NaN Cherbourg yes True 20 man True NaN Southampton no True 21 man True D Southampton yes True 22 child False NaN Queenstown yes True 23 man True A Southampton yes True 24 child False NaN Southampton no False 25 woman False NaN Southampton yes False 26 man True NaN Cherbourg no True 27 man True C Southampton no False 28 woman False NaN Queenstown yes True 29 man True NaN Southampton no True .. ... ... ... ... ... ... 861 man True NaN Southampton no False 862 woman False D Southampton yes True 863 woman False NaN Southampton no False 864 man True NaN Southampton no True 865 woman False NaN Southampton yes True 866 woman False NaN Cherbourg yes False 867 man True A Southampton no True 868 man True NaN Southampton no True 869 child False NaN Southampton yes False 870 man True NaN Southampton no True 871 woman False D Southampton yes False 872 man True B Southampton no True 873 man True NaN Southampton no True 874 woman False NaN Cherbourg yes False 875 child False NaN Cherbourg yes True 876 man True NaN Southampton no True 877 man True NaN Southampton no True 878 man True NaN Southampton no True 879 woman False C Cherbourg yes False 880 woman False NaN Southampton yes False 881 man True NaN Southampton no True 882 woman False NaN Southampton no True 883 man True NaN Southampton no True 884 man True NaN Southampton no True 885 woman False NaN Queenstown no False 886 man True NaN Southampton no True 887 woman False B Southampton yes True 888 woman False NaN Southampton no False 889 man True C Cherbourg yes True 890 man True NaN Queenstown no True [891 rows x 15 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): survived 891 non-null int64 pclass 891 non-null int64 sex 891 non-null object age 714 non-null float64 sibsp 891 non-null int64 parch 891 non-null int64 fare 891 non-null float64 embarked 889 non-null object class 891 non-null category who 891 non-null object adult_male 891 non-null bool deck 203 non-null category embark_town 889 non-null object alive 891 non-null object alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.6+ KB Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/tmp/babel-2374654r/python-23746Z2m", line 5, in <module> nullor = np.array([True, False, False, True]) NameError: name 'np' is not defined
1.2.2 sum up the missing value of each column
sum up the number of NaN
of each column, return a Series of shape (rowNum,)
import numpy as np import pandas as pd # count missing values def count_missing(vec): null_vec = pd.isnull(vec) null_count = np.sum(null_vec) return ( null_count ) def prop_missing(vec): num = count_missing(vec) dem = vec.size return num / dem def prop_complete(vec): return 1 - prop_missing(vec) nan_item_number = titanic.apply(count_missing) print ( nan_item_number )
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64 <class 'pandas.core.series.Series'>
1.2.3 filter out the rows which has a non_NaN value in column 'embark_town'
summary: pandas has a kind of filter same with numpy filter pandas has a another filter, which numpy dosn't has
[df].loc[boolSeries, :]
will get the all rows whose the item inboolSeries
column is True[df].loc[:, boolSeries]
will get the all columns whose the item inboolSeries
row is True
nan_row = titanic.loc[pd.isnull( titanic['embark_town']), :] print ( nan_row )
1.2.4 sum up the missing value of each row
number_nan_in_row = titanic.apply( count_missing, axis=1 ) # count each row, how many NaN print ( number_nan_in_row )
0 1 1 0 2 1 3 0 4 1 5 2 6 0 7 1 8 1 9 1 10 0 11 0 12 1 13 1 14 1 15 1 16 1 17 2 18 1 19 2 20 1 21 0 22 1 23 0 24 1 25 1 26 2 27 0 28 2 29 2 .. 861 1 862 0 863 2 864 1 865 1 866 1 867 0 868 2 869 1 870 1 871 0 872 0 873 1 874 1 875 1 876 1 877 1 878 2 879 0 880 1 881 1 882 1 883 1 884 1 885 1 886 1 887 0 888 2 889 0 890 1 Length: 891, dtype: int64
1.2.5 sum up the number of differnt numbers of each row
number_nan_in_row = titanic.apply( count_missing, axis=1 ) # count each row, how many NaN print ( number_nan_in_row.value_counts() )
1 549 0 182 2 160 dtype: int64
1.2.6 sum up the prop_complete of differnt numbers of each row
number_nan_in_row = titanic.apply( prop_complete, axis=1 ) # count each row, how many NaN print ( number_nan_in_row.value_counts() )
0.933333 549 1.000000 182 0.866667 160 dtype: int64
1.3 pandas filter vs. numpy filter
pandas has 2 kinds of filter | numpy has 1 kinds of filter |
---|---|
loc_filter: df.loc[mask, mask] |
arr_filter: arr[mask] |
df_filter: df[mask] |
mask: is some boolean operation on DataFram/npArray/Series
summary: pandas has a kind of filter same with numpy filter pandas has a another filter, which numpy dosn't has
[df].loc[boolSeries, :]
will get the rows whose the item inboolSeries
column is True
[df].loc[:, boolSeries]
will get the columns whose the item inboolSeries
row is True
. . ~titanic . loc [ pd.isnull( df['b'] ), : ]~ . . | | a | b | c | d | e | . |---+------+-------+-----+----+-----| . | 0 | 2 | 3 | 23 | 87 | 199 | . >>> | 1 | 231 | *NaN* | 989 | 98 | 63 | <<< | 1 | 231 | NaN | 989 | 98 | 63 | . | 2 | 1250 | 111 | 231 | 22 | 233 | | 3 | 89 | NaN | 11 | 8 | 9 | . >>> | 3 | 89 | *NaN* | 11 | 8 | 9 | <<< ^ . | 4 | 1 | 23 | 29 | 9 | 0 | | . | | . | | . | | . | | False | | . | | *True* | | . pd.isnull( v ) ===> | False | | . | *True* | | . | False | | . | | . | | . | | . pd . loc [ v , : ] ------------------+ .
1.4 how to count the number of "True" of a boolean list
np.sum
will default return the number of True of a given np-array.
import numpy as np nullor = np.array([True, True, False, True]) print ( np.sum(nullor) ) # return 3
1.5 how to count the number of specified value in a column
can ONLY use the numpy.
- create a mask:
mask = arr == specifiedVal
- using the filter syntax sugar:
bool_arr = arr[mask]
- using the
sum
method:np.sum(bool_arr)
1.6 pd.isnull() => boolean Series
. | a | . |-----| |------| . | 1 | | False| . | NaN |.........| True | . pd.isnull ( | 237 | ) ===> | False| . | 12 | | False| . | NaN |.........| True | . | 29 | | False| . . ~pd.isnull~ accepy a array-like object, return a same shape boolean array-like object. . all the ~NaN~ will map to ~True~.
1.7 do some exercise, for fun
from numpy import NaN def comp_2(vec): null_vec = pd.isnull(vec) null_num = np.sum(null_vec) num = vec.size return null_num / num def number_2(vec): null_vec = pd.isnull(vec) null_num = np.sum(null_vec) return null_num dfxxx = pd.DataFrame({ 'a' : [1, NaN, 2, 3, 4, NaN], 'b' : [1, 33, 2, 3, 4, NaN], 'c' : [1, 23, NaN, NaN, NaN, NaN], }) num_null_series = dfxxx.apply(comp_2, axis=1) print ( num_null_series )
0 0.000000 1 0.333333 2 0.333333 3 0.333333 4 0.333333 5 1.000000 dtype: float64