scikit-笔记19:非监督学习-基于结构化和密度估计的clustering方法

1. Unsupervised learning: Hierarchical and density-based clustering algorithms
- 1.1. two clustering methods better than K-Means
2. Hierarchical Clustering
- 2.1. Agglomerative and divisive hierarchical clustering
- 2.2. agglomerative hierachical cluetering
3. Density-based Clustering - DBSCAN
- - 3.0.1. what is DBSCAN
  - 3.0.2. 3 kinds of points
4. Exercise
5. Misc tools
- 5.1. scikit-learn
  - 5.1.1. ML models by now

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt

1 Unsupervised learning: Hierarchical and density-based clustering algorithms

1.1 two clustering methods better than K-Means

In a previous notebook, "08 Unsupervised Learning - Clustering.ipynb", we introduced one of the essential and widely used clustering algorithms, K-means. One of the advantages of K-means is that it is extremely easy to implement, and it is also computationally very efficient compared to other clustering algorithms.

However, we've seen that one of the weaknesses of K-Means is that it only works well if the data can be grouped into a globular or spherical shape. Also, we have to assign the number of clusters, k, a priori – this can be a problem if we have no prior knowledge about how many clusters we expect to find.

Summarizing: short-commings of K-means:

So we have 4 short-comings of K-means:

Incorrect number of clusters <<< solved by elbow-method

Anisotropicly distributed data

Different variance

Unevenly sized blobs

In this notebook, we will take a look at 2 alternative approaches to clustering,

hierarchical clustering
density-based clustering.

2 Hierarchical Clustering

One nice feature of hierachical clustering is that we can visualize the results as a dendrogram, a hierachical tree. Using the visualization, we can then decide how "deep" we want to cluster the dataset by setting a "depth" threshold. Or in other words,

overcome K-means by: >>> we don't need to make a decision about the number of clusters upfront.

2.1 Agglomerative and divisive hierarchical clustering

Furthermore, we can distinguish between 2 main approaches to hierarchical clustering:

Divisive clustering
Agglomerative clustering.

In agglomerative clustering, we start with a single sample from our dataset and iteratively merge it with other samples to form clusters – we can see it as a bottom-up approach for building the clustering dendrogram.

In divisive clustering, however, we start with the whole dataset as one cluster, and we iteratively split it into smaller subclusters – a top-down approach.

In this notebook, we will use agglomerative clustering.

2.2 agglomerative hierachical cluetering

2.2.1 how to measure similarity between two samples

Now, the next question is how we measure the similarity between samples. One approach is the familiar Euclidean distance metric that we already used via the K-Means algorithm. As a refresher, the distance between 2 m-dimensional vectors \(\mathbf{p}\) and \(\mathbf{q}\) can be computed as:

\begin{align} \mathrm{d}(\mathbf{q},\mathbf{p}) & = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \cdots + (q_m-p_m)^2} \\[8pt] & = \sqrt{\sum_{j=1}^m (q_j-p_j)^2}.\end{align}

However, that's the distance between 2 samples. Now, how do we compute the similarity between subclusters of samples in order to decide which clusters to merge when constructing the dendrogram? I.e., our goal is to iteratively merge the most similar pairs of clusters until only one big cluster remains. There are many different approaches to this, for example single and complete linkage.

2.2.2 how to measure similarity between two clusters

Single and complete linkage

In single linkage, we take the pair of the most similar samples in each cluster (based on the Euclidean distance, for example), means one of two from one cluster, the other one from another cluster — one sample from one cluster.And merge the two clusters which have the most similar 2 members into one new, bigger cluster.

In complete linkage, we compare the pairs of the two most dissimilar members of each cluster with each other, and we merge the 2 clusters where the distance between its 2 most dissimilar members is smallest.

2.2.3 applying agglomerative clustering by labeled datasets

Why using labeled datset, convenient to check the result.

To see the agglomerative, hierarchical clustering approach in action, let us load the familiar Iris dataset – pretending we don't know the true class labels and want to find out how many different follow species it consists of:

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, [2, 3]] #<- using only the 3rd and 4th features
                         #   easy to visualize
y = iris.target
n_samples, n_features = X.shape
plt.scatter(X[:, 0], X[:, 1], c=y);

2.2.3.1 draw dendrogram by `scipy.cluster.hierarchy.dendrogram` after compute the similarity by `scipy.cluster.hierachy.linkage`

First, we start with some exploratory clustering, visualizing the clustering dendrogram using SciPy's linkage and dendrogram functions:

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
clusters = linkage(X,
                   metric='euclidean', #<- method to evalute the similarity
                   method='complete'   #<- method to combine two cluster
)
dendr = dendrogram(clusters)
plt.ylabel('Euclidean Distance');

2.2.3.2 build agglomerative model by `sklearn.cluster.AgglomerativeClustering`

Next, let's use the AgglomerativeClustering estimator from scikit-learn and divide the dataset into 3 clusters. Can you guess which 3 clusters from the dendrogram it will reproduce?

from sklearn.cluster import AgglomerativeClustering
ac = AgglomerativeClustering(n_clusters=3,
                             affinity='euclidean', #<- method to compute similarity
                             linkage='complete')   #<- method to combine two cluster
prediction = ac.fit_predict(X)
print('Cluster labels: %s\n' % prediction)
plt.scatter(X[:, 0], X[:, 1], c=prediction);

3 Density-based Clustering - DBSCAN

3.0.1 what is DBSCAN

Another useful approach to clustering is Density-based Spatial Clustering of Applications with Noise (DBSCAN).

In essence, we can think of DBSCAN as an algorithm that divides the dataset into subgroup based on dense regions of point.

3.0.2 3 kinds of points

In DBSCAN, we distinguish between 3 different "points",

Core points: A core point is a point that has at least a minimum number of other points (MinPts) in its radius epsilon.
Border points: A border point is a point that is not a core point, since it doesn't have enough MinPts in its neighborhood, but lies within the radius epsilon of a core point.
Noise points: All other points that are neither core points nor border points.

we can split them by two evidence:

MinPts
with in the radius of core point

A nice feature about DBSCAN is that we don't have to specify a number of clusters upfront. However, it requires the setting of additional hyperparameters such as the:

value for MinPts
radius epsilon.

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=400,
                  noise=0.1,
                  random_state=1)
plt.scatter(X[:,0], X[:,1])
plt.show()

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.2,               #<- radius epsilon
            min_samples=10,        #<- MinPts
            metric='euclidean')
prediction = db.fit_predict(X)
print("Predicted labels:\n", prediction)
plt.scatter(X[:, 0], X[:, 1], c=prediction);

4 Exercise

EXERCISE: Using the following toy dataset, two concentric circles, experiment with the three different clustering algorithms that we used so far: KMeans, AgglomerativeClustering, and DBSCAN. Which clustering algorithms reproduces or discovers the hidden structure (pretending we don't know y) best? Can you explain why this particular algorithm is a good choice while the other 2 "fail"?

from sklearn.datasets import make_circles
X, y = make_circles(n_samples=1500,
                    factor=.4,
                    noise=.05)
plt.scatter(X[:, 0], X[:, 1], c=y);

5 Misc tools

5.1 scikit-learn

5.1.1 ML models by now

from sklearn.datasets import make_blobs

from sklearn.datasets import make_moons *

from sklearn.datasets import make_circles *

from sklearn.datasets import make_regression

from sklearn.datasets import load_iris

from sklearn.datasets import load_digits

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import learning_curve

from sklearn.feature_extraction import DictVectorizer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_selection import SelectPercentile

from sklearn.feature_selection import f_classif

from sklearn.feature_selection import f_regression

from sklearn.feature_selection import chi2

from sklearn.feature_selection import SelectFromModel

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge

from sklearn.linear_model import Lasso

from sklearn.linear_model import ElasticNet

from sklearn.neighbors import KNeighborsClassifier

from sklearn.neighbors import KNeighborsRegressor

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics.scorer import SCORERS

from sklearn.metrics import r2_score

from sklearn.cluster import KMeans

from sklearn.cluster import KMeans

from sklearn.cluster import MeanShift

from sklearn.cluster import DBSCAN # <<< this algorithm has related sources in LIHONGYI's lecture-12

from sklearn.cluster import AffinityPropagation

from sklearn.cluster import SpectralClustering

from sklearn.cluster import Ward

from sklearn.cluster import DBSCAN *

from sklearn.cluster import AgglomerativeClustering *

from scipy.cluster.hierarchy import linkage *

from scipy.cluster.hierarchy import dendrogram *

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics import classification_report

from sklearn.preprocessing import Imputer

from sklearn.dummy import DummyClassifier

from sklearn.pipeline import make_pipeline

from sklearn.svm import LinearSVC

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import GradientBoostingRegressor