scikit-笔记15:性能矩阵与模型评价

1. Model Evaluation, Scoring Metrics, and Dealing with Imbalanced Classes
- 1.1. beyond the default evaluation of the model
2. Built-In and custom scoring functions
3. Misc tools
- 3.1. scikit-learn
  - 3.1.1. ML models by now
- 3.2. statistics
  - 3.2.1. coefficient of determination
  - 3.2.2. auc and roc curve

1 Model Evaluation, Scoring Metrics, and Dealing with Imbalanced Classes

1.1 beyond the default evaluation of the model

In the previous notebook, we already went into some detail on how to evaluate a model and how to pick the best model(pick the best model means pick the best parameters).

So far, we assumed that we were given a performance measure, a measure of the quality of the model. What measure one should use is not always obvious, though. The default scores in scikit-learn are

accuracy for classification, which is the fraction of correctly classified samples,
r2 for regression, with is the coefficient of determination.

These are reasonable default choices in many scenarious; however, depending on our task, these are not always the definitive or recommended choices.

Let's take look at classification in more detail, going back to the application of classifying handwritten digits. So, how about training a classifier and walking through the different ways we can evaluate it? Scikit-learn has many helpful methods in the sklearn.metrics module that can help us with this task:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(precision=2)

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=1,
                                                    stratify=y,
                                                    test_size=0.25)
classifier = LinearSVC(random_state=1).fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
print("Accuracy: {}".format(classifier.score(X_test, y_test)))

1.1.1 how to know which class are hard to predict in classification problem: method-1

By two tools:

matshow(2d-ndarray)
confusion_matrix(true_value_array, predict_value_array)

Here, we predicted 95.3% of samples correctly. For multi-class problems, it is often interesting to know which of the classes are hard to predict, and which are easy, or which classes get confused. One way to get more information about misclassifications is the confusion_matrix, which shows for each true class, how frequent a given predicted outcome is.

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_test_pred)

array([[44,  0,  0,  0,  1,  0,  0,  0,  0,  0],
[ 0, 45,  0,  0,  0,  0,  0,  0,  1,  0],
[ 0,  1, 43,  0,  0,  0,  0,  0,  0,  0],
[ 0,  0,  0, 45,  0,  0,  0,  0,  0,  1],
[ 0,  2,  0,  0, 41,  0,  0,  1,  0,  1],
[ 0,  1,  0,  0,  0, 42,  1,  0,  0,  2],
[ 0,  0,  0,  0,  0,  1, 44,  0,  0,  0],
[ 0,  1,  0,  0,  1,  0,  0, 42,  1,  0],
[ 0,  4,  0,  1,  0,  0,  0,  0, 37,  1],
[ 0,  0,  0,  0,  0,  0,  0,  0,  0, 45]])

A plot is sometimes more readable:

plt.matshow(confusion_matrix(y_test, y_test_pred), cmap="Blues")
plt.colorbar(shrink=0.8)
plt.xticks(range(10))
plt.yticks(range(10))
plt.xlabel("Predicted label")
plt.ylabel("True label");

We can see that most entries are on the diagonal, which means that we predicted nearly all samples correctly. The off-diagonal entries show us that many eights were classified as ones, and that nines are likely to be confused with many other classes.

1.1.2 how to know which class are hard to predict in classification problem: method-2

By one tool:

classification_report(true_label_array, predict_label_array)

Another useful function is the classification_report which provides:

precision,
recall,
fscore
support

for all classes.

Precision is how many of the predictions for a class are actually that class. With :

TP,"true positive"
FP,"false positive"
TN,"true negative"
FN,"false negative"

repectively:

Precision = TP / (TP + FP)

Recall is how many of the true positives were recovered:

Recall = TP / (TP + FN)

F1-score is the geometric average of precision and recall:

F1 = 2 x (precision x recall) / (precision + recall)

The values of all these values above are in the closed interval [0, 1], where 1 means a perfect score.

from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

1.1.2.1 threshold and TPR FPR

threshold = 0.5

all predicted probability > threshold, predict it positive; all predicted probability < threshold, predict it negative;

this means that :

all red pixels to the right of the line are correct predictions; all blue pixels to the left of the line are correct predictions;

accuracy rate = correct predictions %

[真正经，原本正经，你判断他是正经的] TPR = (red region on right of threshold) / whole red region TPR = TP / (TP + NF) = 真正 / (真正+假负) TPR = recall

[假正经，原本不正经，你判断他是正经的] FPR = (blue region on right of threshold) / whole blue region FPR = FP / (FP + TN) = 假正 / (假正+真负)

1.1.3 why method-2 is better for imbalance class and asymmetric cost

These metrics are helpful in two particular cases that come up often in practice:

Imbalanced classes, that is one class might be much more frequent than the other.
Asymmetric costs, that is one kind of error is much more "costly" than the other.

Let's have a look at 1. first. Say we have a class imbalance of 1:9, which is rather mild (think about ad-click-prediction where maybe 0.001% of ads might be clicked):

1.1.3.1 accuracy is NOT good evaluation way for imbalance datasets

np.bincount(y) / y.shape[0]

array([ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1])

As a toy example, let's say we want to classify the digits three against all other digits:

X, y = digits.data, digits.target == 3

Now we run cross-validation on a classifier to see how well it does:

from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
cross_val_score(SVC(), X, y)

array([ 0.9,  0.9,  0.9])

Our classifier is 90% accurate. Is that good? Or bad? Keep in mind that 90% of the data is "not three". So let's see how well a dummy classifier does, that always predicts the most frequent class:

from sklearn.dummy import DummyClassifier
cross_val_score(DummyClassifier("most_frequent"), X, y)

array([ 0.9,  0.9,  0.9])

Also 90% (as expected)! So one might thing that means our classifier is not very good, it doesn't to better than a simple strategy that doesn't even look at the data. That would be judging too quickly, though.

Accuracy is simply not a good way to evaluate classifiers for imbalanced datasets!

np.bincount(y) / y.shape[0]

array([ 0.9,  0.1])

1.1.3.2 ROC Curves is better for imbalanced datasets

A much better measure is using the so-called ROC (Receiver operating characteristics) curve. A roc-curve works with uncertainty outputs of a classifier, say the "decision_function" of the SVC we trained above. Instead of making a cut-off at zero(one threshold) and looking at classification outcomes, it looks at every possible cut-off(every possible thresholds) and records how many true positive predictions there are, and how many false positive predictions there are.

The following plot compares the roc curve of three parameter settings of our classifier on the "three vs rest" task.

from sklearn.metrics import roc_curve, roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
for gamma in [.01, .05, 1]:
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate (recall)")
    svm = SVC(gamma=gamma).fit(X_train, y_train)
    decision_function = svm.decision_function(X_test)
    fpr, tpr, _ = roc_curve(y_test, decision_function)
    acc = svm.score(X_test, y_test) #<- accuracy_score
    auc = roc_auc_score(y_test, svm.decision_function(X_test)) #<- auc score
    plt.plot(fpr, tpr, label="acc:%.2f auc:%.2f" % (acc, auc), linewidth=3)
    plt.legend(loc="best");

1.1.3.3 interpretation of auc and roc

With a very small decision threshold, there will be few false positives, but also few false negatives, while with a very high threshold, both true positive rate and false positive rate will be high.

So in general, the curve will be from the lower left to the upper right. A diagonal line reflects chance performance, while the goal is to be as much in the top left corner as possible. This means giving a higher decision_function value to all positive samples than to any negative sample.

In this sense, this curve only considers the ranking of the positive and negative samples, not the actual value. As you can see from the curves and the accuracy values in the legend, even though all classifiers have the same accuracy, 89%, which is even lower than the dummy classifier, one of them has a perfect roc curve, while one of them performs on chance level.

For doing grid-search and cross-validation, we usually want to condense our model evaluation into a single number. A good way to do this with the roc curve is to use the area under the curve (AUC). We can simply use this in cross_val_score by specifying scoring="roc_auc":

from sklearn.model_selection import cross_val_score
cross_val_score(SVC(), X, y, scoring="roc_auc")

array([ 1.,  1.,  1.])

2 Built-In and custom scoring functions

There are many more scoring methods available, which are useful for different kinds of tasks. You can find them in the "SCORERS" dictionary. The only documentation explains all of them.

from sklearn.metrics.scorer import SCORERS
print(SCORERS.keys())

It is also possible to define your own scoring metric. Instead of a string, you can provide a callable to as scoring parameter, that is an object with a call method or a function. It needs to take :

a model,
a test-set features X_test
a test-set labels y_test
return a float.

Higher floats are taken to mean better models.

Let's reimplement the standard accuracy score:

def my_accuracy_scoring(est, X, y):
    return np.mean(est.predict(X) == y)
cross_val_score(SVC(), X, y, scoring=my_accuracy_scoring)

array([ 0.9,  0.9,  0.9])

In previous sections, we typically used the accuracy measure to evaluate the performance of our classifiers. A related measure that we haven't talked about, yet, is the average-per-class accuracy (APCA). As we remember, the accuracy is defined as

\(ACC = \frac{TP+TN}{n}\) where n is the total number of samples. This can be generalized to \[ACC = \frac{T}{n},\] where T is the number of all correct predictions in multi-class settings.

Exercise: Given the following arrays of "true" class labels and predicted class labels, can you implement a function that uses the accuracy measure to compute the average-per-class accuracy as shown below?

y_true = np.array([0, 0, 0, 1, 1, 1, 1, 1, 2, 2])
y_pred = np.array([0, 1, 1, 0, 1, 1, 2, 2, 2, 2])
confusion_matrix(y_true, y_pred)

array([[1, 2, 0],
[1, 2, 2],
[0, 0, 2]])

3 Misc tools

3.1 scikit-learn

3.1.1 ML models by now

from sklearn.datasets import make_blobs

from sklearn.datasets import load_iris

from sklearn.datasets import load_digits *

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LinearRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics.scorer import SCORERS *

from sklearn.cluster import KMeans

from sklearn.cluster import KMeans

from sklearn.cluster import MeanShift

from sklearn.cluster import DBSCAN # <<< this algorithm has related sources in LIHONGYI's lecture-12

from sklearn.cluster import AffinityPropagation

from sklearn.cluster import SpectralClustering

from sklearn.cluster import Ward

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics import classification_report *

from sklearn.feature_extraction import DictVectorizer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import Imputer

from sklearn.dummy import DummyClassifier

from sklearn.pipeline import make_pipeline

from sklearn.svm import LinearSVC *

from sklearn.svm import SVC *

3.2 statistics

3.2.1 coefficient of determination

coefficient determination or called r-square(related to SSR), used to evalute the regression error, is a companion concept of square error or called SSE which is used to evalute the classification error .

R^2 = The proportion of the variation in Y being explained by the variation in X

The larger R^2 the better our regression model

It's a measure of that strength of the relationship between x and y.

3.2.1.1 R^2 and degrees of freedom

R^2 affected by degrees of freedom, see this, and the LinTianXuan lecture

3.2.1.2 R^2 and confidence interval

see this

3.2.1.3 R^2 and hypothesis test

see this

Can we infer a relationship between

Number of medas won by a country

and

the country's latitude
the country's average elevation
the country's population

number of medals = \(\beta_0+\beta_1(latitude_i)+\beta_2(elevation_i) + \beta_3(logpopulation_i)\)

3.2.1.4 R^2 and significance

see this

3.2.1.5 SSR and SSE

why we need SSR or SSE, they are concepts related with variance of a continuous random variable

Third, the definition of the variance of a continuous random variable \(Var(X)\) is \(Var(X) = E[(X-\mu)^2] = \int_{-\infty}^{\infty}{(x-\mu)^2 f(x) dx}\), as detailed here. Again, you only need to solve for the integral in the support. Alternatively, it is sometimes easier to rely on the equivalent expression \(Var(X) = E[(X-\mu)^2] = E[X^2] - (E[X])^2\), where the first term is \(E[X^2] = \int_{-\infty}^{\infty}{x^2 f(x) dx}\) (see the definition of the expectation in the second paragraph) and the second term is \((E[X])^2 = \mu^2\).

see the (x - mu)^2, this is the SST.

what we want to do is weigh the SST(which is fixed when given a dataset each y_i is fixed, the mean of dataset is fixed) coming more from the predict error, or the distance from mean

the larger the SSR, the smaller the SSE, the better the model is

separate the distance between true value and mean of true values (Y_i to mean(Y_i)) into two components: explained deviation and unexplained deviation:

\(\hat{Y_i}-\bar{Y}\) predict - mean: explained deviation : sum square to SSR
\(Y_i-\hat{Y_i}\) true - predict: unexplained deviation: sum square to SSE

\(Y_i\) : the true y-value of a point i

\(\bar{Y}\) : the mean of all true y-values of data points

\(\hat{Y_i}\) : the predict y-value of data point i

\(SSR=\sum(\hat{Y_i}-\bar{Y})^2\) : sum of square due to regression

\(SSE=\sum(Y_i-\hat{Y_i})^2\) : sum of square due to error

\(SST=SSR+SSE=\sum(Y_i-\bar{Y})^2\) : sum of square due to total

\(R^2=SSR/SST\)

3.2.2 auc and roc curve

https://www.youtube.com/watch?v=OAl6eAyP-yo

An roc curve is a commonly used way to visualize the performace of a binary classifier.

Roc curve is a TRP against FRP plot, with a direction of axes shown below:

. ^ large TRP . | . | . | . | . +--------–—> . large FRP

roc curve:

y axix: True Positive Rate
x axix: False Positive Rate

what we want is high TRP and low FRP, so we expect the roc curve has expansion direction to up left coner as far as possible.

3.2.2.1 study another plot

example one:

Let's see the whole plot (not the up-left coner roc curve) for whether a paper admitted by journal:

x-axis: predicted probabilities
y-axis: count of observations
pixel: each pixel represent a paper
blue and red: are the true label distribution of a paper(rejected or admitted)

3.2.2.2 means of each point in the plot

3.2.2.3 representation

example one:

(0.1, 10) means :

(the axis shows) there are 10 papers which you predict an admission probability of 0.1
(the region where this point locate in shows) the true status for all 10 papers was negative

example two:

(0.5, 10) means :

(the axis shows) there are 20(10 blue, 10 red) papers which you predict an admission probability of 0.1
(the region where this point locate in shows) the true status is that 10 papers was negative, 10 was positive.

3.2.2.4 threshold and TPR FPR

threshold = 0.5

all predicted probability > threshold, predict it positive; all predicted probability < threshold, predict it negative;

this means that :

all red pixels to the right of the line are correct predictions; all blue pixels to the left of the line are correct predictions;

accuracy rate = correct predictions %

[真正经，原本正经，你判断他是正经的] TPR = (red region on right of threshold) / whole red region TPR = TP / (TP + NF) = 真正 / (真正+假负) TPR = recall

[假正经，原本不正经，你判断他是正经的] FPR = (blue region on right of threshold) / whole blue region FPR = FP / (FP + TN) = 假正 / (假正+真负)

3.2.2.5 which fact will affect the ROC curve

ROC curve is a plot of the TPR on the y-axis versus the FPR on the x-axis, for every possible threshold

One threshold only related to a fix TPR and FPR pair.

Each pair of true positive and true negative distribution form one roc curve;

Each choice of threshold on certain pair of distribution form a point of this roc curve;

The larger predicted probability distance (the x-axis shows the predicted probabilities by our model) between this pair distribution, ===> the more roc curve expand to up left coner, ===> the more area of AUC(area under the curve), ===> the better the model is.

3.2.2.6 why roc is better than misclassification rate

Roc curve visualize ALL possible thresholds. misclassification rate only take s SINGLE threshold

3.2.2.7 3 more benifits of roc curve

better for imbalanced dataset. because it will doesn't change how the roc curve is generated. roc only care about the 'RATE'
can do multiple classification by 'one vs. all' approach, and you may should draw 3 curvers instead of 1.
- 1st curve: class 1(possitive) vs classes 2 and 3(negative);
- 2nd curve: class 2(possitive) vs classes 1 and 3(negative);
- 3rd curve: class 3(possitive) vs classes 1 and 2(negative);
can do minimize FPR or maximize TPR
1. minimize FPR: VIP clients admission
2. maximize TPR: AIDS testing

scikit-笔记15:性能矩阵与模型评价

Table of Contents

1 Model Evaluation, Scoring Metrics, and Dealing with Imbalanced Classes

1.1 beyond the default evaluation of the model

1.1.1 how to know which class are hard to predict in classification problem: method-1

1.1.2 how to know which class are hard to predict in classification problem: method-2

1.1.2.1 threshold and TPR FPR

1.1.3 why method-2 is better for imbalance class and asymmetric cost

1.1.3.1 accuracy is NOT good evaluation way for imbalance datasets

1.1.3.2 ROC Curves is better for imbalanced datasets

1.1.3.3 interpretation of auc and roc

2 Built-In and custom scoring functions

3 Misc tools

3.1 scikit-learn

3.1.1 ML models by now

3.2 statistics

3.2.1 coefficient of determination

3.2.1.1 R^2 and degrees of freedom

3.2.1.2 R^2 and confidence interval

3.2.1.3 R^2 and hypothesis test

3.2.1.4 R^2 and significance

3.2.1.5 SSR and SSE

3.2.2 auc and roc curve

3.2.2.1 study another plot

3.2.2.2 means of each point in the plot

3.2.2.3 representation

3.2.2.4 threshold and TPR FPR

3.2.2.5 which fact will affect the ROC curve

3.2.2.6 why roc is better than misclassification rate

3.2.2.7 3 more benifits of roc curve