UP | HOME

scikit-笔记15:性能矩阵与模型评价

Table of Contents

1 Model Evaluation, Scoring Metrics, and Dealing with Imbalanced Classes

1.1 beyond the default evaluation of the model

In the previous notebook, we already went into some detail on how to evaluate a model and how to pick the best model(pick the best model means pick the best parameters).

So far, we assumed that we were given a performance measure, a measure of the quality of the model. What measure one should use is not always obvious, though. The default scores in scikit-learn are

  • accuracy for classification, which is the fraction of correctly classified samples,
  • r2 for regression, with is the coefficient of determination.

These are reasonable default choices in many scenarious; however, depending on our task, these are not always the definitive or recommended choices.

Let's take look at classification in more detail, going back to the application of classifying handwritten digits. So, how about training a classifier and walking through the different ways we can evaluate it? Scikit-learn has many helpful methods in the sklearn.metrics module that can help us with this task:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
np.set_printoptions(precision=2)
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=1,
                                                    stratify=y,
                                                    test_size=0.25)
classifier = LinearSVC(random_state=1).fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
print("Accuracy: {}".format(classifier.score(X_test, y_test)))

1.1.1 how to know which class are hard to predict in classification problem: method-1

By two tools:

  • matshow(2d-ndarray)
  • confusion_matrix(true_value_array, predict_value_array)

Here, we predicted 95.3% of samples correctly. For multi-class problems, it is often interesting to know which of the classes are hard to predict, and which are easy, or which classes get confused. One way to get more information about misclassifications is the confusion_matrix, which shows for each true class, how frequent a given predicted outcome is.

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_test_pred)
array([[44,  0,  0,  0,  1,  0,  0,  0,  0,  0],
[ 0, 45,  0,  0,  0,  0,  0,  0,  1,  0],
[ 0,  1, 43,  0,  0,  0,  0,  0,  0,  0],
[ 0,  0,  0, 45,  0,  0,  0,  0,  0,  1],
[ 0,  2,  0,  0, 41,  0,  0,  1,  0,  1],
[ 0,  1,  0,  0,  0, 42,  1,  0,  0,  2],
[ 0,  0,  0,  0,  0,  1, 44,  0,  0,  0],
[ 0,  1,  0,  0,  1,  0,  0, 42,  1,  0],
[ 0,  4,  0,  1,  0,  0,  0,  0, 37,  1],
[ 0,  0,  0,  0,  0,  0,  0,  0,  0, 45]])

A plot is sometimes more readable:

plt.matshow(confusion_matrix(y_test, y_test_pred), cmap="Blues")
plt.colorbar(shrink=0.8)
plt.xticks(range(10))
plt.yticks(range(10))
plt.xlabel("Predicted label")
plt.ylabel("True label");

25041E4L.png

We can see that most entries are on the diagonal, which means that we predicted nearly all samples correctly. The off-diagonal entries show us that many eights were classified as ones, and that nines are likely to be confused with many other classes.

1.1.2 how to know which class are hard to predict in classification problem: method-2

By one tool:

  • classification_report(true_label_array, predict_label_array)

Another useful function is the classification_report which provides:

  • precision,
  • recall,
  • fscore
  • support

for all classes.

Precision is how many of the predictions for a class are actually that class. With :

  • TP,"true positive"
  • FP,"false positive"
  • TN,"true negative"
  • FN,"false negative"

repectively:

Precision = TP / (TP + FP)

Recall is how many of the true positives were recovered:

Recall = TP / (TP + FN)

F1-score is the geometric average of precision and recall:

F1 = 2 x (precision x recall) / (precision + recall)

The values of all these values above are in the closed interval [0, 1], where 1 means a perfect score.

from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))
1.1.2.1 threshold and TPR FPR

threshold = 0.5

screenshot_2018-06-13_01-32-51.png

all predicted probability > threshold, predict it positive; all predicted probability < threshold, predict it negative;

this means that :

all red pixels to the right of the line are correct predictions; all blue pixels to the left of the line are correct predictions;

accuracy rate = correct predictions %

[真正经,原本正经,你判断他是正经的] TPR = (red region on right of threshold) / whole red region TPR = TP / (TP + NF) = 真正 / (真正+假负) TPR = recall

[假正经,原本不正经,你判断他是正经的] FPR = (blue region on right of threshold) / whole blue region FPR = FP / (FP + TN) = 假正 / (假正+真负)

1.1.3 why method-2 is better for imbalance class and asymmetric cost

These metrics are helpful in two particular cases that come up often in practice:

  • Imbalanced classes, that is one class might be much more frequent than the other.
  • Asymmetric costs, that is one kind of error is much more "costly" than the other.

Let's have a look at 1. first. Say we have a class imbalance of 1:9, which is rather mild (think about ad-click-prediction where maybe 0.001% of ads might be clicked):

1.1.3.1 accuracy is NOT good evaluation way for imbalance datasets
np.bincount(y) / y.shape[0]
array([ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1])

As a toy example, let's say we want to classify the digits three against all other digits:

X, y = digits.data, digits.target == 3

Now we run cross-validation on a classifier to see how well it does:

from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
cross_val_score(SVC(), X, y)
array([ 0.9,  0.9,  0.9])

Our classifier is 90% accurate. Is that good? Or bad? Keep in mind that 90% of the data is "not three". So let's see how well a dummy classifier does, that always predicts the most frequent class:

from sklearn.dummy import DummyClassifier
cross_val_score(DummyClassifier("most_frequent"), X, y)
array([ 0.9,  0.9,  0.9])

Also 90% (as expected)! So one might thing that means our classifier is not very good, it doesn't to better than a simple strategy that doesn't even look at the data. That would be judging too quickly, though.

Accuracy is simply not a good way to evaluate classifiers for imbalanced datasets!

np.bincount(y) / y.shape[0]
array([ 0.9,  0.1])

1.1.3.2 ROC Curves is better for imbalanced datasets

A much better measure is using the so-called ROC (Receiver operating characteristics) curve. A roc-curve works with uncertainty outputs of a classifier, say the "decision_function" of the SVC we trained above. Instead of making a cut-off at zero(one threshold) and looking at classification outcomes, it looks at every possible cut-off(every possible thresholds) and records how many true positive predictions there are, and how many false positive predictions there are.

The following plot compares the roc curve of three parameter settings of our classifier on the "three vs rest" task.

from sklearn.metrics import roc_curve, roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
for gamma in [.01, .05, 1]:
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate (recall)")
    svm = SVC(gamma=gamma).fit(X_train, y_train)
    decision_function = svm.decision_function(X_test)
    fpr, tpr, _ = roc_curve(y_test, decision_function)
    acc = svm.score(X_test, y_test) #<- accuracy_score
    auc = roc_auc_score(y_test, svm.decision_function(X_test)) #<- auc score
    plt.plot(fpr, tpr, label="acc:%.2f auc:%.2f" % (acc, auc), linewidth=3)
    plt.legend(loc="best");

25041RCS.png

1.1.3.3 interpretation of auc and roc

With a very small decision threshold, there will be few false positives, but also few false negatives, while with a very high threshold, both true positive rate and false positive rate will be high.

So in general, the curve will be from the lower left to the upper right. A diagonal line reflects chance performance, while the goal is to be as much in the top left corner as possible. This means giving a higher decision_function value to all positive samples than to any negative sample.

In this sense, this curve only considers the ranking of the positive and negative samples, not the actual value. As you can see from the curves and the accuracy values in the legend, even though all classifiers have the same accuracy, 89%, which is even lower than the dummy classifier, one of them has a perfect roc curve, while one of them performs on chance level.

For doing grid-search and cross-validation, we usually want to condense our model evaluation into a single number. A good way to do this with the roc curve is to use the area under the curve (AUC). We can simply use this in cross_val_score by specifying scoring="roc_auc":

from sklearn.model_selection import cross_val_score
cross_val_score(SVC(), X, y, scoring="roc_auc")
array([ 1.,  1.,  1.])

2 Built-In and custom scoring functions

There are many more scoring methods available, which are useful for different kinds of tasks. You can find them in the "SCORERS" dictionary. The only documentation explains all of them.

from sklearn.metrics.scorer import SCORERS
print(SCORERS.keys())

It is also possible to define your own scoring metric. Instead of a string, you can provide a callable to as scoring parameter, that is an object with a call method or a function. It needs to take :

  1. a model,
  2. a test-set features X_test
  3. a test-set labels y_test
  4. return a float.

Higher floats are taken to mean better models.

Let's reimplement the standard accuracy score:

def my_accuracy_scoring(est, X, y):
    return np.mean(est.predict(X) == y)
cross_val_score(SVC(), X, y, scoring=my_accuracy_scoring)

array([ 0.9,  0.9,  0.9])

In previous sections, we typically used the accuracy measure to evaluate the performance of our classifiers. A related measure that we haven't talked about, yet, is the average-per-class accuracy (APCA). As we remember, the accuracy is defined as ​

\(ACC = \frac{TP+TN}{n}\) ​ where n is the total number of samples. This can be generalized to ​ \[ACC = \frac{T}{n},\] ​ where T is the number of all correct predictions in multi-class settings.

average-per-class.png

Exercise: Given the following arrays of "true" class labels and predicted class labels, can you implement a function that uses the accuracy measure to compute the average-per-class accuracy as shown below?

y_true = np.array([0, 0, 0, 1, 1, 1, 1, 1, 2, 2])
y_pred = np.array([0, 1, 1, 0, 1, 1, 2, 2, 2, 2])
confusion_matrix(y_true, y_pred)

array([[1, 2, 0],
[1, 2, 2],
[0, 0, 2]])

3 Misc tools

3.1 scikit-learn

3.1.1 ML models by now

  1. from sklearn.datasets import make_blobs
  2. from sklearn.datasets import load_iris
  3. from sklearn.datasets import load_digits *
  4. from sklearn.model_selection import train_test_split
  5. from sklearn.model_selection import cross_val_score
  6. from sklearn.model_selection import KFold
  7. from sklearn.model_selection import StratifiedKFold
  8. from sklearn.model_selection import ShuffleSplit
  9. from sklearn.model_selection import GridSearchCV
  10. from sklearn.linear_model import LogisticRegression
  11. from sklearn.linear_model import LinearRegression
  12. from sklearn.neighbors import KNeighborsClassifier
  13. from sklearn.neighbors import KNeighborsRegressor
  14. from sklearn.preprocessing import StandardScaler
  15. from sklearn.decomposition import PCA
  16. from sklearn.metrics import confusion_matrix, accuracy_score
  17. from sklearn.metrics import adjusted_rand_score
  18. from sklearn.metrics.scorer import SCORERS *
  19. from sklearn.cluster import KMeans
  20. from sklearn.cluster import KMeans
  21. from sklearn.cluster import MeanShift
  22. from sklearn.cluster import DBSCAN # <<< this algorithm has related sources in LIHONGYI's lecture-12
  23. from sklearn.cluster import AffinityPropagation
  24. from sklearn.cluster import SpectralClustering
  25. from sklearn.cluster import Ward
  26. from sklearn.metrics import confusion_matrix
  27. from sklearn.metrics import accuracy_score
  28. from sklearn.metrics import adjusted_rand_score
  29. from sklearn.metrics import classification_report *
  30. from sklearn.feature_extraction import DictVectorizer
  31. from sklearn.feature_extraction.text import CountVectorizer
  32. from sklearn.feature_extraction.text import TfidfVectorizer
  33. from sklearn.preprocessing import Imputer
  34. from sklearn.dummy import DummyClassifier
  35. from sklearn.pipeline import make_pipeline
  36. from sklearn.svm import LinearSVC *
  37. from sklearn.svm import SVC *

3.2 statistics

3.2.1 coefficient of determination

coefficient determination or called r-square(related to SSR), used to evalute the regression error, is a companion concept of square error or called SSE which is used to evalute the classification error .

R^2 = The proportion of the variation in Y being explained by the variation in X

The larger R^2 the better our regression model

It's a measure of that strength of the relationship between x and y.

3.2.1.1 R^2 and degrees of freedom

R^2 affected by degrees of freedom, see this, and the LinTianXuan lecture

3.2.1.2 R^2 and confidence interval
3.2.1.3 R^2 and hypothesis test

see this

Can we infer a relationship between

Number of medas won by a country

and

  1. the country's latitude
  2. the country's average elevation
  3. the country's population

number of medals = \(\beta_0+\beta_1(latitude_i)+\beta_2(elevation_i) + \beta_3(logpopulation_i)\)

3.2.1.4 R^2 and significance
3.2.1.5 SSR and SSE

why we need SSR or SSE, they are concepts related with variance of a continuous random variable

related URL: https://stats.stackexchange.com/questions/133465/finding-the-mean-and-variance-from-pdf

Third, the definition of the variance of a continuous random variable \(Var(X)\) is \(Var(X) = E[(X-\mu)^2] = \int_{-\infty}^{\infty}{(x-\mu)^2 f(x) dx}\), as detailed here. Again, you only need to solve for the integral in the support. Alternatively, it is sometimes easier to rely on the equivalent expression \(Var(X) = E[(X-\mu)^2] = E[X^2] - (E[X])^2\), where the first term is \(E[X^2] = \int_{-\infty}^{\infty}{x^2 f(x) dx}\) (see the definition of the expectation in the second paragraph) and the second term is \((E[X])^2 = \mu^2\).

see the (x - mu)^2, this is the SST.

what we want to do is weigh the SST(which is fixed when given a dataset each y_i is fixed, the mean of dataset is fixed) coming more from the predict error, or the distance from mean

the larger the SSR, the smaller the SSE, the better the model is

screenshot_2018-06-12_21-38-43.png

separate the distance between true value and mean of true values (Y_i to mean(Y_i)) into two components: explained deviation and unexplained deviation:

  • \(\hat{Y_i}-\bar{Y}\) predict - mean: explained deviation : sum square to SSR
  • \(Y_i-\hat{Y_i}\) true - predict: unexplained deviation: sum square to SSE

\(Y_i\) : the true y-value of a point i

\(\bar{Y}\) : the mean of all true y-values of data points

\(\hat{Y_i}\) : the predict y-value of data point i

\(SSR=\sum(\hat{Y_i}-\bar{Y})^2\) : sum of square due to regression

\(SSE=\sum(Y_i-\hat{Y_i})^2\) : sum of square due to error

\(SST=SSR+SSE=\sum(Y_i-\bar{Y})^2\) : sum of square due to total

\(R^2=SSR/SST\)

3.2.2 auc and roc curve

https://www.youtube.com/watch?v=OAl6eAyP-yo

An roc curve is a commonly used way to visualize the performace of a binary classifier.

Roc curve is a TRP against FRP plot, with a direction of axes shown below:

screenshot_2018-06-13_01-54-31.png

. ^ large TRP . | . | . | . | . +--------–—> . large FRP

roc curve:

  • y axix: True Positive Rate
  • x axix: False Positive Rate

what we want is high TRP and low FRP, so we expect the roc curve has expansion direction to up left coner as far as possible.

3.2.2.1 study another plot

example one: screenshot_2018-06-13_01-07-06.png

Let's see the whole plot (not the up-left coner roc curve) for whether a paper admitted by journal:

  • x-axis: predicted probabilities
  • y-axis: count of observations
  • pixel: each pixel represent a paper
  • blue and red: are the true label distribution of a paper(rejected or admitted)
3.2.2.2 means of each point in the plot
3.2.2.3 representation

example one: screenshot_2018-06-13_01-07-06.png

(0.1, 10) means :

  1. (the axis shows) there are 10 papers which you predict an admission probability of 0.1
  2. (the region where this point locate in shows) the true status for all 10 papers was negative

example two:

screenshot_2018-06-13_01-24-05.png

(0.5, 10) means :

  1. (the axis shows) there are 20(10 blue, 10 red) papers which you predict an admission probability of 0.1
  2. (the region where this point locate in shows) the true status is that 10 papers was negative, 10 was positive.
3.2.2.4 threshold and TPR FPR

threshold = 0.5

screenshot_2018-06-13_01-32-51.png

all predicted probability > threshold, predict it positive; all predicted probability < threshold, predict it negative;

this means that :

all red pixels to the right of the line are correct predictions; all blue pixels to the left of the line are correct predictions;

accuracy rate = correct predictions %

[真正经,原本正经,你判断他是正经的] TPR = (red region on right of threshold) / whole red region TPR = TP / (TP + NF) = 真正 / (真正+假负) TPR = recall

[假正经,原本不正经,你判断他是正经的] FPR = (blue region on right of threshold) / whole blue region FPR = FP / (FP + TN) = 假正 / (假正+真负)

3.2.2.5 which fact will affect the ROC curve

ROC curve is a plot of the TPR on the y-axis versus the FPR on the x-axis, for every possible threshold

One threshold only related to a fix TPR and FPR pair.

Each pair of true positive and true negative distribution form one roc curve;

Each choice of threshold on certain pair of distribution form a point of this roc curve;

The larger predicted probability distance (the x-axis shows the predicted probabilities by our model) between this pair distribution, ===> the more roc curve expand to up left coner, ===> the more area of AUC(area under the curve), ===> the better the model is.

screenshot_2018-06-13_02-10-05.png screenshot_2018-06-13_01-39-58.png screenshot_2018-06-13_02-09-20.png

3.2.2.6 why roc is better than misclassification rate

Roc curve visualize ALL possible thresholds. misclassification rate only take s SINGLE threshold

3.2.2.7 3 more benifits of roc curve
  1. better for imbalanced dataset. because it will doesn't change how the roc curve is generated. roc only care about the 'RATE'
  2. can do multiple classification by 'one vs. all' approach, and you may should draw 3 curvers instead of 1.
    • 1st curve: class 1(possitive) vs classes 2 and 3(negative);
    • 2nd curve: class 2(possitive) vs classes 1 and 3(negative);
    • 3rd curve: class 3(possitive) vs classes 1 and 2(negative);
  3. can do minimize FPR or maximize TPR
    1. minimize FPR: VIP clients admission
    2. maximize TPR: AIDS testing