Pandas-笔记2
Table of Contents
- 1. Apply, part-VI
- 1.1. define some complex computation
- 1.2. Titanic data statistics
- 1.2.1. read in data from seaborn dataset
- 1.2.2. sum up the missing value of each column
- 1.2.3. filter out the rows which has a non_NaN value in column 'embark_town'
- 1.2.4. sum up the missing value of each row
- 1.2.5. sum up the number of differnt numbers of each row
- 1.2.6. sum up the prop_complete of differnt numbers of each row
- 1.3. pandas filter vs. numpy filter
- 1.4. how to count the number of "True" of a boolean list
- 1.5. how to count the number of specified value in a column
- 1.6. pd.isnull() => boolean Series
- 1.7. do some exercise, for fun
1 Apply, part-VI
Apply, this will happens at the time like:
- if you're pivoting tables, but that's not enough to clean your data.
- for each row of your data, you want to perform a more complex calculation, other than adding two columns or calculate the mean of the row.
Then, you can def the function you want, then using apply(funcName) on
Series(row/column)
Note that, mathOperation(Series) => Series is very important for
combination, it keeps entry-type and exit-type the same.
1.1 define some complex computation
import pandas as pd def my_function(): # indent 4 spaces # function code pass def my_sq(x): """squares a given value """ return x**2 def avg_2(x, y): """calcuate average of 2 numbers """ return (x+y)/2 print ( my_sq(2) ) print ( my_sq(4) ) print ( avg_2(4, 9) )
4 16 6.5
1.1.1 mathematical operation on Series is element-wise
df = pd.DataFrame({ 'a' : [10, 20, 30], 'b' : [20, 30, 40] }) print ( df ) print ( df['a'] ** 2) ## element-wise squaring
a b 0 10 20 1 20 30 2 30 40 0 100 1 400 2 900 Name: a, dtype: int64
1.1.2 apply user-defined function on Series
Apply on the Series is default element-wise operation.
. | a | . +----+ . | 10 | . apply ( my_sq ) ---> my_sq( 10 ) , my_sq( 20 ) , my_sq( 30 ) . | 20 | . | 30 | . \---------/ \---------/ \---------/ . | | | . v | | . 100.0 | | . a new Series 400.0 <----------+ | . with shape = (3,) 900.0 <-------------------------+
print ( df['a'] ) print ( df['a'].apply(my_sq)) ## apply function ~my_sq~ element-wise
0 10 1 20 2 30 Name: a, dtype: int64 0 100 1 400 2 900 Name: a, dtype: int64
1.1.3 apply user-defined function on DataFrame, column-wise
apply on the DataFrame is default column-wise operation.
What actually happends when you do apply on an df, it's default apply
the function on each column, means the column is passed into the 1st
argument of the function.
. | | a | b | | a | | b | . +---+----+----+ +----+ +----+ . | 0 | 10 | 20 | . apply ( avg_3 ) ---> avg_3( | 10 | ) , avg_3( | 20 | ) . | 1 | 20 | 30 | | 20 | | 30 | . | 2 | 30 | 40 | | 30 | | 40 | . \-------------/ \-------------/ . | | . v | . 20.0 | . a new Series 30.0 <--------------------+ . with shape = (2,)
def print_me(x): print(x) df.apply(print_me)
0 10 1 20 2 30 Name: a, dtype: int64 0 20 1 30 2 40 Name: b, dtype: int64
1.1.4 xxx
def avg_3(x, y, z): """avg of 3 numbers """ return (x+y+z)/3 df.apply(avg_3) # wrong
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/babel-32610Fla/python-32610oRZ", line 6, in <module>
df.apply(avg_3) # wrong
File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
TypeError: ("avg_3() missing 2 required positional arguments: 'y' and 'z'", 'occurred at index a')
This caused some error, what actually happends when you do apply on an df, it's default apply the function on each column,
means the column is passed into the 1st argument of the function.
. | | a | b | | a | | b | . +---+----+----+ +----+ +----+ . | 0 | 10 | 20 | . apply ( avg_3 ) ---> avg_3( | 10 | ) , avg_3( | 20 | ) . | 1 | 20 | 30 | | 20 | | 30 | . | 2 | 30 | 40 | | 30 | | 40 | . \-------------/ \-------------/ . | | . v | . 20.0 | . a new Series 30.0 <--------------------+ . with shape = (2,)
def avg_3(col): """avg of 3 numbers """ x=col[0] y=col[1] z=col[2] return (x+y+z)/3 avg_result = df.apply(avg_3) # right print ( df ) print ( avg_result ) print ( avg_result.shape ) print ( type(avg_result) )
a b 0 10 20 1 20 30 2 30 40 a 20.0 b 30.0 dtype: float64 (2,) <class 'pandas.core.series.Series'>
1.1.5 apply user-defined function on DataFrame, row-wise
add parameter axis = 1 in the apply will give you the row-wise operation.
def avg_2(row): """avg of 3 numbers """ x=row[0] y=row[1] return (x+y)/2 row_avg = df.apply(avg_2, axis=1) print ( df ) print ( row_avg )
a b 0 10 20 1 20 30 2 30 40 0 15.0 1 25.0 2 35.0 dtype: float64
1.2 Titanic data statistics
1.2.1 read in data from seaborn dataset
import seaborn as sns titanic = sns.load_dataset('titanic') print ( titanic ) titanic.info()
survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
5 0 3 male NaN 0 0 8.4583 Q Third
6 0 1 male 54.0 0 0 51.8625 S First
7 0 3 male 2.0 3 1 21.0750 S Third
8 1 3 female 27.0 0 2 11.1333 S Third
9 1 2 female 14.0 1 0 30.0708 C Second
10 1 3 female 4.0 1 1 16.7000 S Third
11 1 1 female 58.0 0 0 26.5500 S First
12 0 3 male 20.0 0 0 8.0500 S Third
13 0 3 male 39.0 1 5 31.2750 S Third
14 0 3 female 14.0 0 0 7.8542 S Third
15 1 2 female 55.0 0 0 16.0000 S Second
16 0 3 male 2.0 4 1 29.1250 Q Third
17 1 2 male NaN 0 0 13.0000 S Second
18 0 3 female 31.0 1 0 18.0000 S Third
19 1 3 female NaN 0 0 7.2250 C Third
20 0 2 male 35.0 0 0 26.0000 S Second
21 1 2 male 34.0 0 0 13.0000 S Second
22 1 3 female 15.0 0 0 8.0292 Q Third
23 1 1 male 28.0 0 0 35.5000 S First
24 0 3 female 8.0 3 1 21.0750 S Third
25 1 3 female 38.0 1 5 31.3875 S Third
26 0 3 male NaN 0 0 7.2250 C Third
27 0 1 male 19.0 3 2 263.0000 S First
28 1 3 female NaN 0 0 7.8792 Q Third
29 0 3 male NaN 0 0 7.8958 S Third
.. ... ... ... ... ... ... ... ... ...
861 0 2 male 21.0 1 0 11.5000 S Second
862 1 1 female 48.0 0 0 25.9292 S First
863 0 3 female NaN 8 2 69.5500 S Third
864 0 2 male 24.0 0 0 13.0000 S Second
865 1 2 female 42.0 0 0 13.0000 S Second
866 1 2 female 27.0 1 0 13.8583 C Second
867 0 1 male 31.0 0 0 50.4958 S First
868 0 3 male NaN 0 0 9.5000 S Third
869 1 3 male 4.0 1 1 11.1333 S Third
870 0 3 male 26.0 0 0 7.8958 S Third
871 1 1 female 47.0 1 1 52.5542 S First
872 0 1 male 33.0 0 0 5.0000 S First
873 0 3 male 47.0 0 0 9.0000 S Third
874 1 2 female 28.0 1 0 24.0000 C Second
875 1 3 female 15.0 0 0 7.2250 C Third
876 0 3 male 20.0 0 0 9.8458 S Third
877 0 3 male 19.0 0 0 7.8958 S Third
878 0 3 male NaN 0 0 7.8958 S Third
879 1 1 female 56.0 0 1 83.1583 C First
880 1 2 female 25.0 0 1 26.0000 S Second
881 0 3 male 33.0 0 0 7.8958 S Third
882 0 3 female 22.0 0 0 10.5167 S Third
883 0 2 male 28.0 0 0 10.5000 S Second
884 0 3 male 25.0 0 0 7.0500 S Third
885 0 3 female 39.0 0 5 29.1250 Q Third
886 0 2 male 27.0 0 0 13.0000 S Second
887 1 1 female 19.0 0 0 30.0000 S First
888 0 3 female NaN 1 2 23.4500 S Third
889 1 1 male 26.0 0 0 30.0000 C First
890 0 3 male 32.0 0 0 7.7500 Q Third
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
5 man True NaN Queenstown no True
6 man True E Southampton no True
7 child False NaN Southampton no False
8 woman False NaN Southampton yes False
9 child False NaN Cherbourg yes False
10 child False G Southampton yes False
11 woman False C Southampton yes True
12 man True NaN Southampton no True
13 man True NaN Southampton no False
14 child False NaN Southampton no True
15 woman False NaN Southampton yes True
16 child False NaN Queenstown no False
17 man True NaN Southampton yes True
18 woman False NaN Southampton no False
19 woman False NaN Cherbourg yes True
20 man True NaN Southampton no True
21 man True D Southampton yes True
22 child False NaN Queenstown yes True
23 man True A Southampton yes True
24 child False NaN Southampton no False
25 woman False NaN Southampton yes False
26 man True NaN Cherbourg no True
27 man True C Southampton no False
28 woman False NaN Queenstown yes True
29 man True NaN Southampton no True
.. ... ... ... ... ... ...
861 man True NaN Southampton no False
862 woman False D Southampton yes True
863 woman False NaN Southampton no False
864 man True NaN Southampton no True
865 woman False NaN Southampton yes True
866 woman False NaN Cherbourg yes False
867 man True A Southampton no True
868 man True NaN Southampton no True
869 child False NaN Southampton yes False
870 man True NaN Southampton no True
871 woman False D Southampton yes False
872 man True B Southampton no True
873 man True NaN Southampton no True
874 woman False NaN Cherbourg yes False
875 child False NaN Cherbourg yes True
876 man True NaN Southampton no True
877 man True NaN Southampton no True
878 man True NaN Southampton no True
879 woman False C Cherbourg yes False
880 woman False NaN Southampton yes False
881 man True NaN Southampton no True
882 woman False NaN Southampton no True
883 man True NaN Southampton no True
884 man True NaN Southampton no True
885 woman False NaN Queenstown no False
886 man True NaN Southampton no True
887 woman False B Southampton yes True
888 woman False NaN Southampton no False
889 man True C Cherbourg yes True
890 man True NaN Queenstown no True
[891 rows x 15 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived 891 non-null int64
pclass 891 non-null int64
sex 891 non-null object
age 714 non-null float64
sibsp 891 non-null int64
parch 891 non-null int64
fare 891 non-null float64
embarked 889 non-null object
class 891 non-null category
who 891 non-null object
adult_male 891 non-null bool
deck 203 non-null category
embark_town 889 non-null object
alive 891 non-null object
alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/babel-2374654r/python-23746Z2m", line 5, in <module>
nullor = np.array([True, False, False, True])
NameError: name 'np' is not defined
1.2.2 sum up the missing value of each column
sum up the number of NaN of each column, return a Series of shape (rowNum,)
import numpy as np import pandas as pd # count missing values def count_missing(vec): null_vec = pd.isnull(vec) null_count = np.sum(null_vec) return ( null_count ) def prop_missing(vec): num = count_missing(vec) dem = vec.size return num / dem def prop_complete(vec): return 1 - prop_missing(vec) nan_item_number = titanic.apply(count_missing) print ( nan_item_number )
survived 0 pclass 0 sex 0 age 177 sibsp 0 parch 0 fare 0 embarked 2 class 0 who 0 adult_male 0 deck 688 embark_town 2 alive 0 alone 0 dtype: int64 <class 'pandas.core.series.Series'>
1.2.3 filter out the rows which has a non_NaN value in column 'embark_town'
summary: pandas has a kind of filter same with numpy filter pandas has a another filter, which numpy dosn't has
[df].loc[boolSeries, :]will get the all rows whose the item inboolSeriescolumn is True[df].loc[:, boolSeries]will get the all columns whose the item inboolSeriesrow is True
nan_row = titanic.loc[pd.isnull( titanic['embark_town']), :] print ( nan_row )
1.2.4 sum up the missing value of each row
number_nan_in_row = titanic.apply( count_missing, axis=1 ) # count each row, how many NaN print ( number_nan_in_row )
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 1
9 1
10 0
11 0
12 1
13 1
14 1
15 1
16 1
17 2
18 1
19 2
20 1
21 0
22 1
23 0
24 1
25 1
26 2
27 0
28 2
29 2
..
861 1
862 0
863 2
864 1
865 1
866 1
867 0
868 2
869 1
870 1
871 0
872 0
873 1
874 1
875 1
876 1
877 1
878 2
879 0
880 1
881 1
882 1
883 1
884 1
885 1
886 1
887 0
888 2
889 0
890 1
Length: 891, dtype: int64
1.2.5 sum up the number of differnt numbers of each row
number_nan_in_row = titanic.apply( count_missing, axis=1 ) # count each row, how many NaN print ( number_nan_in_row.value_counts() )
1 549 0 182 2 160 dtype: int64
1.2.6 sum up the prop_complete of differnt numbers of each row
number_nan_in_row = titanic.apply( prop_complete, axis=1 ) # count each row, how many NaN print ( number_nan_in_row.value_counts() )
0.933333 549 1.000000 182 0.866667 160 dtype: int64
1.3 pandas filter vs. numpy filter
| pandas has 2 kinds of filter | numpy has 1 kinds of filter |
|---|---|
loc_filter: df.loc[mask, mask] |
arr_filter: arr[mask] |
df_filter: df[mask] |
mask: is some boolean operation on DataFram/npArray/Series
summary: pandas has a kind of filter same with numpy filter pandas has a another filter, which numpy dosn't has
[df].loc[boolSeries, :]will get the rows whose the item inboolSeriescolumn is True
[df].loc[:, boolSeries]will get the columns whose the item inboolSeriesrow is True
. . ~titanic . loc [ pd.isnull( df['b'] ), : ]~ . . | | a | b | c | d | e | . |---+------+-------+-----+----+-----| . | 0 | 2 | 3 | 23 | 87 | 199 | . >>> | 1 | 231 | *NaN* | 989 | 98 | 63 | <<< | 1 | 231 | NaN | 989 | 98 | 63 | . | 2 | 1250 | 111 | 231 | 22 | 233 | | 3 | 89 | NaN | 11 | 8 | 9 | . >>> | 3 | 89 | *NaN* | 11 | 8 | 9 | <<< ^ . | 4 | 1 | 23 | 29 | 9 | 0 | | . | | . | | . | | . | | False | | . | | *True* | | . pd.isnull( v ) ===> | False | | . | *True* | | . | False | | . | | . | | . | | . pd . loc [ v , : ] ------------------+ .
1.4 how to count the number of "True" of a boolean list
np.sum will default return the number of True of a given np-array.
import numpy as np nullor = np.array([True, True, False, True]) print ( np.sum(nullor) ) # return 3
1.5 how to count the number of specified value in a column
can ONLY use the numpy.
- create a mask:
mask = arr == specifiedVal - using the filter syntax sugar:
bool_arr = arr[mask] - using the
summethod:np.sum(bool_arr)
1.6 pd.isnull() => boolean Series
. | a | . |-----| |------| . | 1 | | False| . | NaN |.........| True | . pd.isnull ( | 237 | ) ===> | False| . | 12 | | False| . | NaN |.........| True | . | 29 | | False| . . ~pd.isnull~ accepy a array-like object, return a same shape boolean array-like object. . all the ~NaN~ will map to ~True~.
1.7 do some exercise, for fun
from numpy import NaN def comp_2(vec): null_vec = pd.isnull(vec) null_num = np.sum(null_vec) num = vec.size return null_num / num def number_2(vec): null_vec = pd.isnull(vec) null_num = np.sum(null_vec) return null_num dfxxx = pd.DataFrame({ 'a' : [1, NaN, 2, 3, 4, NaN], 'b' : [1, 33, 2, 3, 4, NaN], 'c' : [1, 23, NaN, NaN, NaN, NaN], }) num_null_series = dfxxx.apply(comp_2, axis=1) print ( num_null_series )
0 0.000000 1 0.333333 2 0.333333 3 0.333333 4 0.333333 5 1.000000 dtype: float64