scikit-笔记16:深入理解线性模型

1. Linear Regression
2. Linear Classification
- 2.1. Bi-class linear classification
- 2.2. Multi-class linear classification
  - 2.2.1. shape of coef_ and intercept_
  - 2.2.2. how to plot these 3 lines
3. EXERCISE
4. Misc tools

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

1 Linear Regression

Linear models are useful when little data is available or for very large feature spaces as in text classification. In addition, they form a good case study for regularization.

1.1 what is a regularization

All linear models for regression learn a coefficient parameter coef_ and an offset intercept_ to make predictions using a linear combination of features:

regression fn model:

y_pred = x_test[0] * coef_[0] + … + x_test[n_features-1] * coef_[n_features-1] + intercept_

1.1.1 difference due to regularization

The difference between the linear models for regression is what kind of restrictions or penalties are put on coef_ as regularization , in addition to fitting the training data well.

1.1.2 linear regression is bad due to no regularization

The most standard linear model is the 'ordinary least squares regression, often simply called 'linear regression'. It doesn't put any additional restrictions on coef_, so when the number of features is large, it becomes ill-posed and the model overfits.

Let us generate a simple simulation, to see the behavior of these models.

make_regression can create a data set.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y, true_coefficient = make_regression(n_samples=200, n_features=30, n_informative=10, noise=100, coef=True, random_state=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5, train_size=60, test_size=140)
print(X_train.shape) # (60,30)
print(y_train.shape) # (60,)

1.2 Linear Regression (without Regularization)

\(\text{min}_{w, b} \sum_i || w^\mathsf{T}x_i + b - y_i||^2\)

about score, which is an evaluation of the model:

every model has a built-in method, almost with 'model_name'.score(dataset, labelset)
many other evaluation method inside sklearn.metrics, almost with 'scorer_name'.(predict_label, truelabel), like r2_score, adjusted_rand_score, etc

from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression().fit(X_train, y_train)
print("R^2 on training set: %f" % linear_regression.score(X_train, y_train))
print("R^2 on test set: %f" % linear_regression.score(X_test, y_test))

from sklearn.metrics import r2_score
print(r2_score(np.dot(X, true_coefficient), y))

plt.figure(figsize=(10, 5))
print (true_coefficient.shape)
coefficient_sorting = np.argsort(true_coefficient)[::-1] #<- inverse the order of result of argsort
plt.plot(true_coefficient[coefficient_sorting], "o", label="true")
plt.plot(linear_regression.coef_[coefficient_sorting], "o", label="linear regression")
plt.legend()

<matplotlib.legend.Legend at 0x7ff5069e0438>

from sklearn.model_selection import learning_curve
def plot_learning_curve(est, X, y):
    training_set_size, train_scores, test_scores = learning_curve(est, X, y, train_sizes=np.linspace(.1, 1, 20))
    estimator_name = est.__class__.__name__
    line = plt.plot(training_set_size, train_scores.mean(axis=1), '--', label="training scores " + estimator_name)
    plt.plot(training_set_size, test_scores.mean(axis=1), '-', label="test scores " + estimator_name, c=line[0].get_color())
    plt.xlabel('Training set size')
    plt.legend(loc='best')
    plt.ylim(-0.1, 1.1)
plt.figure()
plot_learning_curve(LinearRegression(), X, y)

1.3 Ridge Regression (L2 Regularization)

The Ridge estimator is a simple regularization (called l2 penalty) of the ordinary LinearRegression. In particular, it has the benefit of being not computationally more expensive than the ordinary least square estimate.

\[ \text{min}_{w,b} \sum_i || w^\mathsf{T}x_i + b - y_i||^2 + \alpha ||w||_2^2\]

The amount of regularization is set via the alpha parameter of the Ridge.

from sklearn.linear_model import Ridge
ridge_models = {}
training_scores = []
test_scores = []
for alpha in [100, 10, 1, .01]:
    ridge = Ridge(alpha=alpha).fit(X_train, y_train)
    training_scores.append(ridge.score(X_train, y_train))
    test_scores.append(ridge.score(X_test, y_test))
    ridge_models[alpha] = ridge
plt.figure()
plt.plot(training_scores, label="training scores")
plt.plot(test_scores, label="test scores")
plt.xticks(range(4), [100, 10, 1, .01])
plt.legend(loc="best")

<matplotlib.legend.Legend at 0x7ff5069675c0>

plt.figure(figsize=(10, 5))
plt.plot(true_coefficient[coefficient_sorting], "o", label="true", c='b')
for i, alpha in enumerate([100, 10, 1, .01]):
    plt.plot(ridge_models[alpha].coef_[coefficient_sorting],
             "o",
             label="alpha = %.2f" % alpha,
             c=plt.cm.summer(i / 3.) #<- how to give a gradually changed color
    )
plt.legend(loc="best")

<matplotlib.legend.Legend at 0x7ff50680fc50>

Tuning alpha is critical for performance.

plt.figure()
plot_learning_curve(LinearRegression(), X, y)
plot_learning_curve(Ridge(alpha=10), X, y)

1.4 Lasso (L1 Regularization)

The Lasso estimator is useful to impose sparsity on the coefficient. In other words, it is to be prefered if we believe that many of the features are not relevant. This is done via the so-called l1 penalty.

\(\text{min}_{w, b} \sum_i \frac{1}{2} || w^\mathsf{T}x_i + b - y_i||^2 + \alpha ||w||_1\)

from sklearn.linear_model import Lasso
lasso_models = {}
training_scores = []
test_scores = []
for alpha in [30, 10, 1, .01]:
    lasso = Lasso(alpha=alpha).fit(X_train, y_train)
    training_scores.append(lasso.score(X_train, y_train))
    test_scores.append(lasso.score(X_test, y_test))
    lasso_models[alpha] = lasso
plt.figure()
plt.plot(training_scores, label="training scores")
plt.plot(test_scores, label="test scores")
plt.xticks(range(4), [30, 10, 1, .01])
plt.legend(loc="best")

<matplotlib.legend.Legend at 0x7f9c30ff9550>

plt.figure(figsize=(10, 5))
plt.plot(true_coefficient[coefficient_sorting], "o", label="true", c='b')
for i, alpha in enumerate([30, 10, 1, .01]):
    plt.plot(lasso_models[alpha].coef_[coefficient_sorting], "o", label="alpha = %.2f" % alpha, c=plt.cm.summer(i / 3.))
plt.legend(loc="best")

<matplotlib.legend.Legend at 0x7f9c30ed32e8>

plt.figure(figsize=(10, 5))
plot_learning_curve(LinearRegression(), X, y)
plot_learning_curve(Ridge(alpha=10), X, y)
plot_learning_curve(Lasso(alpha=10), X, y)

Instead of picking Ridge or Lasso, you can also use ElasticNet, which uses both forms of regularization and provides a parameter to assign a weighting between them. ElasticNet typically performs the best amongst these models.

2 Linear Classification

2.1 Bi-class linear classification

2.1.1 regression model vs. classification model

Regression fn model:

y_pred = x_test[0] * coef_[0] + … + x_test[n_features-1] * coef_[n_features-1] + intercept_

All linear models for classification learn a coefficient parameter coef_ and an offset intercept_ to make predictions using a linear combination of features:

Classification fn model:

y_pred = x_test[0] * coef_[0] + … + x_test[n_features-1] * coef_[n_features-1] + intercept_ > 0

As you can see, this is very similar to regression, only that a threshold at zero is applied.

Again, the difference between the linear models for classification what kind of regularization is put on coef_ and intercept_, but there are also minor differences in how the fit to the training set is measured (the so-called loss function).

The two most common models for linear classification are the linear SVM as implemented in LinearSVC and LogisticRegression.

A good intuition for regularization of linear classifiers is that with high regularization, it is enough if most of the points are classified correctly. But with less regularization, more importance is given to each individual data point. This is illustrated using an linear SVM with different values of C below.

2.1.2 The influence of C in LinearSVC

In LinearSVC, the C parameter controls the regularization within the model.

Lower C entails more regularization and simpler models, whereas higher C entails less regularization and more influence from individual data points.

from figures import plot_linear_svc_regularization
plot_linear_svc_regularization()

2.1.3 l1 regularization vs. l2 regularization

Similar to the Ridge/Lasso separation, you can set the penalty parameter to:

'l1' to enforce sparsity of the coefficients (similar to Lasso)
'l2' to encourage smaller coefficients (similar to Ridge).

We can see,

. xxx regularization apply on yyy , yyy will change to zzz . | | | . v v v . l1 coef_ sparse coef_ . l2 coef_ small coef_

2.2 Multi-class linear classification

from sklearn.datasets import make_blobs
plt.figure()
X, y = make_blobs(random_state=42)
print (X.shape, y.shape) #<- two features
print (X, y) #<- two features
plt.scatter(X[:, 0], X[:, 1], c=plt.cm.spectral(y / 2.));

from sklearn.svm import LinearSVC
linear_svm = LinearSVC().fit(X, y)
print(linear_svm.coef_.shape)
print(linear_svm.intercept_.shape)

2.2.1 shape of coef_ and intercept_

you can see if you want to : separte the 2 class, you should need 1 split line; separte the 3 class, you should need 3 split line; separte the 4 class, you should need 6 split line;

So the shape of coef_ should be 3 in this case.

2.2.2 how to plot these 3 lines

You should note, why we use the code shown below, to plot split line. As we said that fn model for classification is :

y_pred = x_test[0] * coef_[0] + … + x_test[n_features-1] * coef_[n_features-1] + intercept_ > 0

We get 2d coef_ for each line(totally we have 3 lines), for each line we should make make one dimension of the 2d point(x1,x2), as x-axis and another as y-axis, to plot the line for example:

x1 –> x-axis
x2 –> y-axis

In another word, this means that we should make point

x1 as 'x' the independent variable;
x2 as 'y' the dependent variable;

x_test[0] * coef_[0] + x_test[1] * coef_[1] + intercept_ = 0

x_test[1] = - (x_test[0] * coef_[0] + intercept_) / coef_[1]

This is the origin of code below:

plt.plot(line, -(line * coef[0] + intercept) / coef[1])

plt.scatter(X[:, 0], X[:, 1], c=plt.cm.spectral(y / 2.))
line = np.linspace(-15, 15)
for coef, intercept in zip(linear_svm.coef_, linear_svm.intercept_):
    plt.plot(line, -(line * coef[0] + intercept) / coef[1])
plt.ylim(-10, 15)
plt.xlim(-10, 8);

Points are classified in a one-vs-rest fashion (aka one-vs-all), where we assign a test point to the class whose model has the highest confidence (in the SVM case, highest distance to the separating hyperplane) for the test point.

3 EXERCISE

Use LogisticRegression to classify the digits data set, and grid-search the C parameter.

How do you think the learning curves above change when you increase or decrease alpha? Try changing the alpha parameter in ridge and lasso, and see if your intuition was correct.

from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
digits = load_digits()
X_digits, y_digits = digits.data, digits.target

4 Misc tools

4.1 scikit-learn

4.1.1 ML models by now

from sklearn.datasets import make_blobs

from sklearn.datasets import make_regression *

from sklearn.datasets import load_iris

from sklearn.datasets import load_digits

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import learning_curve *

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge *

from sklearn.linear_model import Lasso *

from sklearn.linear_model import ElasticNet *

from sklearn.neighbors import KNeighborsClassifier

from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics.scorer import SCORERS

from sklearn.metrics import r2_score *

from sklearn.cluster import KMeans

from sklearn.cluster import KMeans

from sklearn.cluster import MeanShift

from sklearn.cluster import DBSCAN # <<< this algorithm has related sources in LIHONGYI's lecture-12

from sklearn.cluster import AffinityPropagation

from sklearn.cluster import SpectralClustering

from sklearn.cluster import Ward

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics import classification_report

from sklearn.feature_extraction import DictVectorizer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.preprocessing import Imputer

from sklearn.dummy import DummyClassifier

from sklearn.pipeline import make_pipeline

from sklearn.svm import LinearSVC

from sklearn.svm import SVC

4.1.2 make_regression

Generate a random regression problem.

The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile. See make_low_rank_matrix for more details.

The output is generated by applying a (potentially biased) random linear regression model with n_informative nonzero regressors to the previously generated input and some gaussian centered noise with some adjustable scale.

make_regression(n_samples=100,      #<- number of samples.
                n_features=100,     #<- number of features.
                n_informative=10,   #<- number of truely useful features
                n_targets=1,        #<- the dimension of y output
                bias=0.0,           #<- bias term in underlying linear model
                effective_rank=None,#<- The approximate number of singular
                                    #vectors required to explain most of the
                                    #input data by linear combinations.
                tail_strength=0.5,
                noise=0.0,          #<- The standard deviation of the gaussian
                                    #noise applied to the output.
                shuffle=True,
                coef=False,         #<- whether return coef of this undelying
                                    #linear model
                random_state=None)

4.1.3 learning_curve

Learning curve.

Determines cross-validated training and test scores for different training set sizes.

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size.

training_set_size, train_scores, test_scores = learning_curve(
    est,  #<- the ML model used to predict
    X,    #<- dataset passed to this model
    y,    #<- labels passed to this model
    train_sizes=np.linspace(.1, 1, 20) #<- commonly usage way: linspace(.1,1, num)
)

0 - ee240ab5-f4df-4c6b-8fa9-dc13d1590680

4.1.3.1 return

train_sizes_abs : array, shape = (n_unique_ticks,), dtype int

Numbers of training examples that has been used to generate the learning curve. Note that the number of ticks might be less than n_ticks because duplicate entries will be removed.

train_scores : array, shape (n_ticks, n_cv_folds)

Scores on training sets.

test_scores : array, shape (n_ticks, n_cv_folds)

Scores on test set.

4.1.4 Ridge

Ridge is another linear_model, and has the identical inteface with linear_model on methods invocation

ridge = Ridge()
ridge.fit(X,y)
ridge.predict(X)
ridge.score(X,y) : model_name.score(X,y)

4.1.5 Lasso

Lasso is another linear_model, and has the identical inteface with linear_model on methods invocation

lasso = Lasso()
lasso.fit(X,y)
lasso.predict(X)
lasso.score(X,y) : model_name.score(X,y)

4.2 Linear algebra

4.2.1 SVD

The singular value decomposition of a matrix A is the factorization of A into the product of three matrices \(A = UDV^T\) where the columns of U and V are orthonormal and the matrix D is diagonal with positive real entries.

4.3 scikit user guid

4.3.1 5.4. Sample generators

http://scikit-learn.org/stable/datasets/index.html#sample-generators

In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity.

4.3.1.1 5.4.1. Generators for classification and clustering

These generators produce a matrix of features and corresponding discrete targets.

4.3.1.2 5.4.1.1. Single label

Both make_blobs and make_classification create multiclass datasets by allocating each class one or more normally-distributed clusters of points.

make_blobs provides greater control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering.

make_classification specialises in introducing noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space.

make_gaussian_quantiles divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres.

make_hastie_10_2 generates a similar binary, 10-dimensional problem.

../_images/sphx_glr_plot_random_dataset_0011.png

make_circles and make_moons generate 2d binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for visualisation. produces Gaussian data with a spherical decision boundary for binary classification.

4.3.1.3 5.4.1.2. Multilabel

make_multilabel_classification generates random samples with multiple labels, reflecting a bag of words drawn from a mixture of topics. The number of topics for each document is drawn from a Poisson distribution, and the topics themselves are drawn from a fixed random distribution. Similarly, the number of words is drawn from Poisson, with words drawn from a multinomial, where each topic defines a probability distribution over words. Simplifications with respect to true bag-of-words mixtures include:

Per-topic word distributions are independently drawn, where in reality all would be affected by a sparse base distribution, and would be correlated. For a document generated from multiple topics, all topics are weighted equally in generating its bag of words. Documents without labels words at random, rather than from a base distribution.

../_images/sphx_glr_plot_random_multilabel_dataset_0011.png

4.3.1.4 5.4.1.3. Biclustering

make_biclusters(shape, n_clusters[, noise, …]) Generate an array with constant block diagonal structure for biclustering. make_checkerboard(shape, n_clusters[, …]) Generate an array with block checkerboard structure for biclustering.

4.3.1.5 5.4.2. Generators for regression

make_regression produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance).

Other regression generators generate functions deterministically from randomized features.

make_sparse_uncorrelated produces a target as a linear combination of four features with fixed coefficients.

Others encode explicitly non-linear relations:

make_friedman1 is related by polynomial and sine transforms; make_friedman2 includes feature multiplication and reciprocation; and make_friedman3 is similar with an arctan transformation on the target.

4.3.1.6 5.4.3. Generators for manifold learning

make_s_curve([n_samples, noise, random_state]) Generate an S curve dataset. make_swiss_roll([n_samples, noise, random_state]) Generate a swiss roll dataset.

4.3.1.7 5.4.4. Generators for decomposition

make_low_rank_matrix([n_samples, …]) Generate a mostly low rank matrix with bell-shaped singular values make_sparse_coded_signal(n_samples, …[, …]) Generate a signal as a sparse combination of dictionary elements. make_spd_matrix(n_dim[, random_state]) Generate a random symmetric, positive-definite matrix. make_sparse_spd_matrix([dim, alpha, …]) Generate a sparse symmetric definite positive matrix.

4.4 Numpy

4.4.1 np.argsort

Returns the indices that would sort an array, for multiple dimension array, sort and return each array('[]' denote an array) itself.

Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in sorted order.

import numpy as np
arr = np.array([[4,2,1,0],[5,9,8,7]])
argarr = np.argsort(arr)
argarr

array([[3, 2, 1, 0],
[0, 3, 2, 1]])

4.5 Matplotlib

4.5.1 how to give a gradually changed color

plt.figure(figsize=(10, 5))
plt.plot(true_coefficient[coefficient_sorting], "o", label="true", c='b')
for i, alpha in enumerate([100, 10, 1, .01]):
    plt.plot(ridge_models[alpha].coef_[coefficient_sorting],
             "o",
             label="alpha = %.2f" % alpha,
             c=plt.cm.summer(i / 3.) #<- how to give a gradually changed color
    )
plt.legend(loc="best")

4.5.2 what are tickets

specify the indexs of axes

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.set_xticks([0.15, 0.68, 0.97])
ax.set_yticks([0.2, 0.55, 0.76])
plt.show()