UP | HOME

Pandas-笔记2

Table of Contents

1 Apply, part-VI

Apply, this will happens at the time like:

  • if you're pivoting tables, but that's not enough to clean your data.
  • for each row of your data, you want to perform a more complex calculation, other than adding two columns or calculate the mean of the row.

Then, you can def the function you want, then using apply(funcName) on Series(row/column)

Note that, mathOperation(Series) => Series is very important for combination, it keeps entry-type and exit-type the same.

1.1 define some complex computation

import pandas as pd
def my_function():
      # indent 4 spaces
      # function code
      pass

def my_sq(x):
      """squares a given value
      """
      return x**2

def avg_2(x, y):
      """calcuate average of 2 numbers
      """
      return (x+y)/2

print ( my_sq(2) )
print ( my_sq(4) )
print ( avg_2(4, 9) )
4
16
6.5

1.1.1 mathematical operation on Series is element-wise

df = pd.DataFrame({
    'a' : [10, 20, 30],
    'b' : [20, 30, 40]
    })
print ( df )
print ( df['a'] ** 2) ## element-wise squaring
a   b
0  10  20
1  20  30
2  30  40
0    100
1    400
2    900
Name: a, dtype: int64

1.1.2 apply user-defined function on Series

Apply on the Series is default element-wise operation.

.    |  a |
.    +----+
.    | 10 |  .  apply ( my_sq ) --->  my_sq( 10 )  ,  my_sq( 20 )  ,  my_sq( 30 )
.    | 20 |
.    | 30 |
.                                     \---------/     \---------/     \---------/
.                                            |              |              |
.                                            v              |              |
.                                         100.0             |              |
.              a new Series               400.0  <----------+              |
.          with shape = (3,)              900.0  <-------------------------+


print ( df['a'] )
print ( df['a'].apply(my_sq)) ## apply function ~my_sq~ element-wise
0    10
1    20
2    30
Name: a, dtype: int64
0    100
1    400
2    900
Name: a, dtype: int64

1.1.3 apply user-defined function on DataFrame, column-wise

apply on the DataFrame is default column-wise operation.

What actually happends when you do apply on an df, it's default apply the function on each column, means the column is passed into the 1st argument of the function.

.    |   |  a |  b |                                  |  a |                   |  b |
.    +---+----+----+                                  +----+                   +----+
.    | 0 | 10 | 20 |  .  apply ( avg_3 ) --->  avg_3( | 10 | )     ,    avg_3( | 20 | )
.    | 1 | 20 | 30 |                                  | 20 |                   | 30 |
.    | 2 | 30 | 40 |                                  | 30 |                   | 40 |
.                                              \-------------/          \-------------/
.                                                     |                        |
.                                                     v                        |
.                                                   20.0                       |
.                       a new Series                30.0  <--------------------+
.                 with shape = (2,)
def print_me(x):
    print(x)
df.apply(print_me)
0    10
1    20
2    30
Name: a, dtype: int64
0    20
1    30
2    40
Name: b, dtype: int64

1.1.4 xxx

def avg_3(x, y, z):
    """avg of 3 numbers
    """
    return (x+y+z)/3

df.apply(avg_3)  # wrong
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/babel-32610Fla/python-32610oRZ", line 6, in <module>
    df.apply(avg_3)  # wrong
  File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
    ignore_failures=ignore_failures)
  File "/home/yiddi/anaconda3/envs/tensorflow/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
    results[i] = func(v)
TypeError: ("avg_3() missing 2 required positional arguments: 'y' and 'z'", 'occurred at index a')

This caused some error, what actually happends when you do apply on an df, it's default apply the function on each column, means the column is passed into the 1st argument of the function.

.    |   |  a |  b |                                  |  a |                   |  b |
.    +---+----+----+                                  +----+                   +----+
.    | 0 | 10 | 20 |  .  apply ( avg_3 ) --->  avg_3( | 10 | )     ,    avg_3( | 20 | )
.    | 1 | 20 | 30 |                                  | 20 |                   | 30 |
.    | 2 | 30 | 40 |                                  | 30 |                   | 40 |
.                                              \-------------/          \-------------/
.                                                     |                        |
.                                                     v                        |
.                                                   20.0                       |
.                       a new Series                30.0  <--------------------+
.                 with shape = (2,)
def avg_3(col):
    """avg of 3 numbers
    """
    x=col[0]
    y=col[1]
    z=col[2]
    return (x+y+z)/3

avg_result = df.apply(avg_3)  # right
print ( df )
print ( avg_result )
print ( avg_result.shape )
print ( type(avg_result) )
a   b
0  10  20
1  20  30
2  30  40
a    20.0
b    30.0
dtype: float64
(2,)
<class 'pandas.core.series.Series'>

1.1.5 apply user-defined function on DataFrame, row-wise

add parameter axis = 1 in the apply will give you the row-wise operation.

def avg_2(row):
    """avg of 3 numbers
    """
    x=row[0]
    y=row[1]
    return (x+y)/2
row_avg = df.apply(avg_2, axis=1)
print ( df )
print ( row_avg )
a   b
0  10  20
1  20  30
2  30  40
0    15.0
1    25.0
2    35.0
dtype: float64

1.2 Titanic data statistics

1.2.1 read in data from seaborn dataset

import seaborn as sns
titanic = sns.load_dataset('titanic')
print ( titanic )
titanic.info()
survived  pclass     sex   age  sibsp  parch      fare embarked   class  \
0           0       3    male  22.0      1      0    7.2500        S   Third
1           1       1  female  38.0      1      0   71.2833        C   First
2           1       3  female  26.0      0      0    7.9250        S   Third
3           1       1  female  35.0      1      0   53.1000        S   First
4           0       3    male  35.0      0      0    8.0500        S   Third
5           0       3    male   NaN      0      0    8.4583        Q   Third
6           0       1    male  54.0      0      0   51.8625        S   First
7           0       3    male   2.0      3      1   21.0750        S   Third
8           1       3  female  27.0      0      2   11.1333        S   Third
9           1       2  female  14.0      1      0   30.0708        C  Second
10          1       3  female   4.0      1      1   16.7000        S   Third
11          1       1  female  58.0      0      0   26.5500        S   First
12          0       3    male  20.0      0      0    8.0500        S   Third
13          0       3    male  39.0      1      5   31.2750        S   Third
14          0       3  female  14.0      0      0    7.8542        S   Third
15          1       2  female  55.0      0      0   16.0000        S  Second
16          0       3    male   2.0      4      1   29.1250        Q   Third
17          1       2    male   NaN      0      0   13.0000        S  Second
18          0       3  female  31.0      1      0   18.0000        S   Third
19          1       3  female   NaN      0      0    7.2250        C   Third
20          0       2    male  35.0      0      0   26.0000        S  Second
21          1       2    male  34.0      0      0   13.0000        S  Second
22          1       3  female  15.0      0      0    8.0292        Q   Third
23          1       1    male  28.0      0      0   35.5000        S   First
24          0       3  female   8.0      3      1   21.0750        S   Third
25          1       3  female  38.0      1      5   31.3875        S   Third
26          0       3    male   NaN      0      0    7.2250        C   Third
27          0       1    male  19.0      3      2  263.0000        S   First
28          1       3  female   NaN      0      0    7.8792        Q   Third
29          0       3    male   NaN      0      0    7.8958        S   Third
..        ...     ...     ...   ...    ...    ...       ...      ...     ...
861         0       2    male  21.0      1      0   11.5000        S  Second
862         1       1  female  48.0      0      0   25.9292        S   First
863         0       3  female   NaN      8      2   69.5500        S   Third
864         0       2    male  24.0      0      0   13.0000        S  Second
865         1       2  female  42.0      0      0   13.0000        S  Second
866         1       2  female  27.0      1      0   13.8583        C  Second
867         0       1    male  31.0      0      0   50.4958        S   First
868         0       3    male   NaN      0      0    9.5000        S   Third
869         1       3    male   4.0      1      1   11.1333        S   Third
870         0       3    male  26.0      0      0    7.8958        S   Third
871         1       1  female  47.0      1      1   52.5542        S   First
872         0       1    male  33.0      0      0    5.0000        S   First
873         0       3    male  47.0      0      0    9.0000        S   Third
874         1       2  female  28.0      1      0   24.0000        C  Second
875         1       3  female  15.0      0      0    7.2250        C   Third
876         0       3    male  20.0      0      0    9.8458        S   Third
877         0       3    male  19.0      0      0    7.8958        S   Third
878         0       3    male   NaN      0      0    7.8958        S   Third
879         1       1  female  56.0      0      1   83.1583        C   First
880         1       2  female  25.0      0      1   26.0000        S  Second
881         0       3    male  33.0      0      0    7.8958        S   Third
882         0       3  female  22.0      0      0   10.5167        S   Third
883         0       2    male  28.0      0      0   10.5000        S  Second
884         0       3    male  25.0      0      0    7.0500        S   Third
885         0       3  female  39.0      0      5   29.1250        Q   Third
886         0       2    male  27.0      0      0   13.0000        S  Second
887         1       1  female  19.0      0      0   30.0000        S   First
888         0       3  female   NaN      1      2   23.4500        S   Third
889         1       1    male  26.0      0      0   30.0000        C   First
890         0       3    male  32.0      0      0    7.7500        Q   Third

       who  adult_male deck  embark_town alive  alone
0      man        True  NaN  Southampton    no  False  
1    woman       False    C    Cherbourg   yes  False  
2    woman       False  NaN  Southampton   yes   True
3    woman       False    C  Southampton   yes  False
4      man        True  NaN  Southampton    no   True
5      man        True  NaN   Queenstown    no   True
6      man        True    E  Southampton    no   True
7    child       False  NaN  Southampton    no  False
8    woman       False  NaN  Southampton   yes  False
9    child       False  NaN    Cherbourg   yes  False
10   child       False    G  Southampton   yes  False
11   woman       False    C  Southampton   yes   True
12     man        True  NaN  Southampton    no   True
13     man        True  NaN  Southampton    no  False
14   child       False  NaN  Southampton    no   True
15   woman       False  NaN  Southampton   yes   True
16   child       False  NaN   Queenstown    no  False
17     man        True  NaN  Southampton   yes   True
18   woman       False  NaN  Southampton    no  False
19   woman       False  NaN    Cherbourg   yes   True
20     man        True  NaN  Southampton    no   True
21     man        True    D  Southampton   yes   True
22   child       False  NaN   Queenstown   yes   True
23     man        True    A  Southampton   yes   True
24   child       False  NaN  Southampton    no  False
25   woman       False  NaN  Southampton   yes  False
26     man        True  NaN    Cherbourg    no   True
27     man        True    C  Southampton    no  False
28   woman       False  NaN   Queenstown   yes   True
29     man        True  NaN  Southampton    no   True
..     ...         ...  ...          ...   ...    ...
861    man        True  NaN  Southampton    no  False
862  woman       False    D  Southampton   yes   True
863  woman       False  NaN  Southampton    no  False
864    man        True  NaN  Southampton    no   True
865  woman       False  NaN  Southampton   yes   True
866  woman       False  NaN    Cherbourg   yes  False
867    man        True    A  Southampton    no   True
868    man        True  NaN  Southampton    no   True
869  child       False  NaN  Southampton   yes  False
870    man        True  NaN  Southampton    no   True
871  woman       False    D  Southampton   yes  False
872    man        True    B  Southampton    no   True
873    man        True  NaN  Southampton    no   True
874  woman       False  NaN    Cherbourg   yes  False
875  child       False  NaN    Cherbourg   yes   True
876    man        True  NaN  Southampton    no   True
877    man        True  NaN  Southampton    no   True
878    man        True  NaN  Southampton    no   True
879  woman       False    C    Cherbourg   yes  False
880  woman       False  NaN  Southampton   yes  False
881    man        True  NaN  Southampton    no   True
882  woman       False  NaN  Southampton    no   True
883    man        True  NaN  Southampton    no   True
884    man        True  NaN  Southampton    no   True
885  woman       False  NaN   Queenstown    no  False
886    man        True  NaN  Southampton    no   True
887  woman       False    B  Southampton   yes   True
888  woman       False  NaN  Southampton    no  False
889    man        True    C    Cherbourg   yes   True
890    man        True  NaN   Queenstown    no   True

[891 rows x 15 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/babel-2374654r/python-23746Z2m", line 5, in <module>
    nullor = np.array([True, False, False, True])
NameError: name 'np' is not defined

1.2.2 sum up the missing value of each column

sum up the number of NaN of each column, return a Series of shape (rowNum,)

import numpy as np
import pandas as pd
# count missing values
def count_missing(vec):
    null_vec = pd.isnull(vec)
    null_count = np.sum(null_vec)
    return ( null_count )

def prop_missing(vec):
    num = count_missing(vec)
    dem = vec.size
    return num / dem

def prop_complete(vec):
    return 1 - prop_missing(vec)

nan_item_number = titanic.apply(count_missing)
print ( nan_item_number )
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
<class 'pandas.core.series.Series'>

1.2.3 filter out the rows which has a non_NaN value in column 'embark_town'

summary: pandas has a kind of filter same with numpy filter pandas has a another filter, which numpy dosn't has

[df].loc[boolSeries, :] will get the all rows whose the item in boolSeries column is True [df].loc[:, boolSeries] will get the all columns whose the item in boolSeries row is True

nan_row = titanic.loc[pd.isnull( titanic['embark_town']), :]
print ( nan_row )

1.2.4 sum up the missing value of each row

number_nan_in_row = titanic.apply( count_missing, axis=1 ) # count each row, how many NaN
print ( number_nan_in_row )
0      1
1      0
2      1
3      0
4      1
5      2
6      0
7      1
8      1
9      1
10     0
11     0
12     1
13     1
14     1
15     1
16     1
17     2
18     1
19     2
20     1
21     0
22     1
23     0
24     1
25     1
26     2
27     0
28     2
29     2
      ..
861    1
862    0
863    2
864    1
865    1
866    1
867    0
868    2
869    1
870    1
871    0
872    0
873    1
874    1
875    1
876    1
877    1
878    2
879    0
880    1
881    1
882    1
883    1
884    1
885    1
886    1
887    0
888    2
889    0
890    1
Length: 891, dtype: int64

1.2.5 sum up the number of differnt numbers of each row

number_nan_in_row = titanic.apply( count_missing, axis=1 ) # count each row, how many NaN
print ( number_nan_in_row.value_counts() )
1    549
0    182
2    160
dtype: int64

1.2.6 sum up the prop_complete of differnt numbers of each row

number_nan_in_row = titanic.apply( prop_complete, axis=1 ) # count each row, how many NaN
print ( number_nan_in_row.value_counts() )
0.933333    549
1.000000    182
0.866667    160
dtype: int64

1.3 pandas filter vs. numpy filter

pandas has 2 kinds of filter numpy has 1 kinds of filter
loc_filter: df.loc[mask, mask] arr_filter: arr[mask]
df_filter: df[mask]  

mask: is some boolean operation on DataFram/npArray/Series

summary: pandas has a kind of filter same with numpy filter pandas has a another filter, which numpy dosn't has

[df].loc[boolSeries, :] will get the rows whose the item in boolSeries column is True

[df].loc[:, boolSeries] will get the columns whose the item in boolSeries row is True

.
.   ~titanic . loc [ pd.isnull( df['b'] ), : ]~
.
.      |   |    a |   b   |   c |  d |   e |
.      |---+------+-------+-----+----+-----|
.      | 0 |    2 |   3   |  23 | 87 | 199 |
.  >>> | 1 |  231 | *NaN* | 989 | 98 |  63 | <<<        | 1 |  231 | NaN | 989 | 98 |  63 |
.      | 2 | 1250 | 111   | 231 | 22 | 233 |            | 3 |   89 | NaN |  11 |  8 |   9 |
.  >>> | 3 |   89 | *NaN* |  11 |  8 |   9 | <<<                          ^
.      | 4 |    1 |  23   |  29 |  9 |   0 |                              |
.                    |                                                    |
.                    |                                                    |
.                    |                                                    |
.                    |                | False  |                          |
.                    |                | *True* |                          |
.     pd.isnull(     v     )  ===>    | False  |                          |
.                                     | *True* |                          |
.                                     | False  |                          |
.                                         |                               |
.                                         |                               |
.                                         |                               |
.                          pd . loc [     v  ,  :   ]   ------------------+
.

1.4 how to count the number of "True" of a boolean list

np.sum will default return the number of True of a given np-array.

import numpy as np
nullor = np.array([True, True, False, True])
print ( np.sum(nullor) ) # return 3

1.5 how to count the number of specified value in a column

can ONLY use the numpy.

  1. create a mask: mask = arr == specifiedVal
  2. using the filter syntax sugar: bool_arr = arr[mask]
  3. using the sum method: np.sum(bool_arr)

1.6 pd.isnull() => boolean Series

.                 |   a |
.                 |-----|         |------|
.                 |   1 |         | False|
.                 | NaN |.........| True |
.     pd.isnull ( | 237 | ) ===>  | False|
.                 |  12 |         | False|
.                 | NaN |.........| True |
.                 |  29 |         | False|
.
.     ~pd.isnull~ accepy a array-like object, return a same shape boolean array-like object.
.                 all the ~NaN~ will map to ~True~.

1.7 do some exercise, for fun

from numpy import NaN
def comp_2(vec):
    null_vec = pd.isnull(vec)
    null_num = np.sum(null_vec)
    num = vec.size
    return null_num / num

def number_2(vec):
    null_vec = pd.isnull(vec)
    null_num = np.sum(null_vec)
    return null_num

dfxxx = pd.DataFrame({
    'a' : [1, NaN, 2, 3, 4, NaN],
    'b' : [1, 33, 2, 3, 4, NaN],
    'c' : [1, 23, NaN, NaN, NaN, NaN],
})


num_null_series = dfxxx.apply(comp_2, axis=1)
print ( num_null_series )
0    0.000000
1    0.333333
2    0.333333
3    0.333333
4    0.333333
5    1.000000
dtype: float64