scikit-笔记20:非监督学习-非线性降维

1. Manifold Learning
- 1.1. PCA vs. Manifold learning
  - 1.1.1. weakness of PCA
  - 1.1.2. Manifold learning: overcome PCA
- 1.2. Manifold learning on the digits data
2. Exercise
3. Misc tools
- 3.1. scikit-learn
  - 3.1.1. ML models by now
- 3.2. Matplotlib
  - 3.2.1. module by now
4. code snippet
- 4.1. how to draw one digt in one subplot

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

1 Manifold Learning

1.1 PCA vs. Manifold learning

1.1.1 weakness of PCA

One weakness of PCA is that it cannot detect non-linear features. A set of algorithms known as Manifold Learning have been developed to address this deficiency. A canonical dataset used in Manifold learning is the S-curve:

from sklearn.datasets import make_s_curve
X, y = make_s_curve(n_samples=1000)
from mpl_toolkits.mplot3d import Axes3D
ax = plt.axes(projection='3d')
ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)
ax.view_init(10, -60);

This is a 2-dimensional dataset embedded in three dimensions, but it is embedded in such a way that PCA cannot discover the underlying data orientation:

from sklearn.decomposition import PCA
X_pca = PCA(n_components=2).fit_transform(X)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y);

1.1.2 Manifold learning: overcome PCA

Manifold learning algorithms, however, available in the sklearn.manifold submodule, are able to recover the underlying 2-dimensional manifold:

from sklearn.manifold import Isomap
iso = Isomap(n_neighbors=15, n_components=2)
X_iso = iso.fit_transform(X)
plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y);

1.2 Manifold learning on the digits data

1.2.1 preview the digit image

We can apply manifold learning techniques to much higher dimensional datasets, for example the digits data that we saw before:

from sklearn.datasets import load_digits
digits = load_digits()
fig, axes = plt.subplots(2, 5, figsize=(10, 5),
                         subplot_kw={'xticks':(), 'yticks': ()})
for ax, img in zip(axes.ravel(), digits.images):
    ax.imshow(img, interpolation="none", cmap="gray")

1.2.2 visualize all digit images after pca mapping

8*8 …….. …….. …….. 1*1 …….. -—pca-—> . -—> draw as one point on 2d image to see …….. whether or not we can separate different digits …….. …….. ……..

We can visualize the dataset using a linear technique, such as PCA. We saw this already provides some intuition about the data:

# build a PCA model
pca = PCA(n_components=2)
pca.fit(digits.data)
# transform the digits data onto the first two principal components
digits_pca = pca.transform(digits.data)
# setup 10 colors, each color for each digit from 0 ~ 9
colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
          "#A83683", "#4E655E", "#853541", "#3A3120","#535D8E"]
# setup figsize, x,y limitation
plt.figure(figsize=(10, 10))
plt.xlim(digits_pca[:, 0].min(), digits_pca[:, 0].max() + 1)
plt.ylim(digits_pca[:, 1].min(), digits_pca[:, 1].max() + 1)

# using pca mapping result of each digit image smaple as index;
# using label of each digit image as text you want to draw.
for i in range(len(digits.data)):
    # actually plot the digits as text instead of using scatter
    plt.text(digits_pca[i, 0],                 #<- the x-axis given by model pca
             digits_pca[i, 1],                 #<- the y-axis given by model pca
             str(digits.target[i]),            #<- the digit we want to draw
             color = colors[digits.target[i]], #<- use the true label(0~9) as index of colors array
             fontdict={'weight': 'bold', 'size': 9}
    )
plt.xlabel("first principal component")
plt.ylabel("second principal component");

1.2.3 visualize all digit images after T-SNE mapping

Using a more powerful, nonlinear techinque can provide much better visualizations, though. Here, we are using the t-SNE manifold learning method:

from sklearn.manifold import TSNE
tsne = TSNE(random_state=42)
# use fit_transform instead of fit, as TSNE has no transform method:
digits_tsne = tsne.fit_transform(digits.data)

import matplotlib.pyplot as plt
import numpy as np
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.datasets import make_s_curve
from mpl_toolkits.mplot3d import Axes3D
from sklearn.manifold import Isomap

X, y = make_s_curve(n_samples=1000)
X_pca = PCA(n_components=2).fit_transform(X)
iso = Isomap(n_neighbors=15, n_components=2)
X_iso = iso.fit_transform(X)

# build a PCA model
pca = PCA(n_components=2)
pca.fit(digits.data)

# transform the digits data onto the first two principal components
digits_pca = pca.transform(digits.data)
colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
          "#A83683", "#4E655E", "#853541", "#3A3120","#535D8E"]

# build a TSNE obj
tsne = TSNE(random_state=42)

# use fit_transform instead of fit, as TSNE has no transform method:
# get 2d data points after tsne mapping
digits_tsne = tsne.fit_transform(digits.data)

# set figsize, x,y limitation
plt.figure(figsize=(10, 10))
plt.xlim(digits_tsne[:, 0].min(), digits_tsne[:, 0].max() + 1)
plt.ylim(digits_tsne[:, 1].min(), digits_tsne[:, 1].max() + 1)

# using tsne mapping result of each digit image smaple as index;
# using label of each digit image as text you want to draw.
for i in range(len(digits.data)):
    plt.text(digits_tsne[i, 0],                #<- the x-axis given by model tsne
             digits_tsne[i, 1],                #<- the y-axis given by model tsne
             str(digits.target[i]),            #<- the digit we want to draw
             color = colors[digits.target[i]], #<- use the true label(0~9) as index of colors array
             fontdict={'weight': 'bold', 'size': 9})
plt.show()

t-SNE has a somewhat longer runtime that other manifold learning algorithms, but the result is quite striking. Keep in mind that this algorithm is purely unsupervised, and does not know about the class labels. Still it is able to separate the classes very well (though the classes four, one and nine have been split into multiple groups).

1.2.4 general steps to do clustering and visualize by pca or tsne

Here, just use PCA as example to show

model

build a PCA model,

build obj: pca_obj = PCA()

obj -> model: pca = pca_obj.fit(X_train)

transform

transform the digits by PCA

pca_data = pca.transform(X_train)

setup plot attr

setup 10 colors, each color for each digit from 0 ~ 9

colors = ['', '', …, '']

setup figsize, x,y limitation

plt.figure(figsize=(10,10))

draw digits

using pca mapping result of each digit image smaple as index; using label of each digit image as text you want to draw: for i in len(digits.data()): plt.text(….

the x-axis given by model pca

pca_data[:,0]

the y-axis given by model pca

pca_data[:,1]

the digit we want to draw

str = digits.target[i]

use the true label(0~9) as index of colors array

c = colors[digits.target[i]]

2 Exercise

EXERCISE:

Compare the results of applying isomap to the digits dataset to the results of PCA and t-SNE. Which result do you think looks best?
Given how well t-SNE separated the classes, one might be tempted to use this processing for classification. Try training a K-nearest neighbor classifier on digits data transformed with t-SNE, and compare to the accuracy on using the dataset without any transformation.

3 Misc tools

3.1 scikit-learn

3.1.1 ML models by now

from sklearn.datasets import make_blobs

from sklearn.datasets import make_moons

from sklearn.datasets import make_circles

from sklearn.datasets import make_s_curve *

from mpl_toolkits.mplot3d import Axes3D *

from sklearn.datasets import make_regression

from sklearn.datasets import load_iris

from sklearn.datasets import load_digits

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import learning_curve

from sklearn.feature_extraction import DictVectorizer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_selection import SelectPercentile

from sklearn.feature_selection import f_classif

from sklearn.feature_selection import f_regression

from sklearn.feature_selection import chi2

from sklearn.feature_selection import SelectFromModel

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge

from sklearn.linear_model import Lasso

from sklearn.linear_model import ElasticNet

from sklearn.neighbors import KNeighborsClassifier

from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics.scorer import SCORERS

from sklearn.metrics import r2_score

from sklearn.cluster import KMeans

from sklearn.cluster import KMeans

from sklearn.cluster import MeanShift

from sklearn.cluster import DBSCAN # <<< this algorithm has related sources in LIHONGYI's lecture-12

from sklearn.cluster import AffinityPropagation

from sklearn.cluster import SpectralClustering

from sklearn.cluster import Ward

from sklearn.cluster import DBSCAN

from sklearn.cluster import AgglomerativeClustering

from scipy.cluster.hierarchy import linkage

from scipy.cluster.hierarchy import dendrogram

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics import classification_report

from sklearn.preprocessing import Imputer

from sklearn.dummy import DummyClassifier

from sklearn.pipeline import make_pipeline

from sklearn.svm import LinearSVC

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.decomposition import PCA *

from sklearn.manifold import TSNE *

from sklearn.manifold import Isomap *

3.2 Matplotlib

3.2.1 module by now

from mpl_toolkits.mplot3d import Axes3D *

4 code snippet

4.1 how to draw one digt in one subplot

from sklearn.datasets import load_digits
digits = load_digits()
fig, axes = plt.subplots(2, 5, figsize=(10, 5),
                         subplot_kw={'xticks':(), 'yticks': ()})
for ax, img in zip(axes.ravel(), digits.images):
    ax.imshow(img, interpolation="none", cmap="gray")