scikit-笔记21:非监督学习-离群点检测

1. outlier detection
2. Misc tools
3. code snippet
- 3.1. how to draw one digt in one subplot
- 3.2. how to generate 2-d points with 3 cluster
4. scikit learn guide
- 4.1. 2.8. Density Estimation
  - 4.1.1. 2.8.1. Density Estimation: Histograms
  - 4.1.2. 2.8.2. Kernel Density Estimation
- 4.2. Outlier detection with several methods

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

1 outlier detection

Anomaly detection is a machine learning task that consists in spotting so-called outliers.

“An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.” Johnson 1992

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Outlier/Anomaly Hawkins 1980

1.1 Types of outlier detection setups

Supervised AD
- Labels available for both normal data and outlier
- Similar to rare class mining / imbalanced classification
Semi-supervised AD (Novelty Detection)
- Only normal data available to train
- The algorithm learns on normal data only
Unsupervised AD (Outlier Detection)
- no labels, training set = normal + abnormal data
- Assumption: anomalies are very rare

Let's first get familiar with different unsupervised anomaly detection approaches and algorithms. In order to visualise the output of the different algorithms we consider a toy data set consisting in a two-dimensional Gaussian mixture.

1.2 Anomaly detection with density estimation

1.2.1 Generating the data set

from sklearn.datasets import make_blobs
X, y = make_blobs(n_features=2, centers=3, n_samples=500,
                  random_state=42)

X.shape

(500, 2)

plt.figure()
plt.scatter(X[:, 0], X[:, 1])
plt.show()

1.2.2 find the maximum likelihood KernelDensity model fit for train_data

kernel density is a normal pdf(probability density function).

By fit() we can find the maximum likelihood kernel density fn fit for this 'X' datasets

from sklearn.neighbors.kde import KernelDensity
# Estimate density with a Gaussian kernel density estimator
kde = KernelDensity(kernel='gaussian')
kde = kde.fit(X)
kde

KernelDensity(algorithm='auto', atol=0, bandwidth=1.0, breadth_first=True,
kernel='gaussian', leaf_size=40, metric='euclidean',
metric_params=None, rtol=0)

1.2.3 compute the log-likelihood of each point in train_data

kde_X = kde.score_samples(X)
print(kde_X.shape)  # contains the log-likelihood of all data points in 'X' on
                    # this kde(some like pdf). The smaller it is ,The rarer is
                    # the sample. https://www.youtube.com/watch?v=ddqny3aZNPY

1.2.4 find the 95% cut point of all points' log-likelihood

because the decision_function(or called value-function, compute the value of each point) of this kde is log-likelihood(point). So every decision_function value(can get by score_samples(point)) says something about the occurence probability of this point on this kde( or called pdf). For that reason, we can just keep the largest 95% score_samples(x) value

from scipy.stats.mstats import mquantiles
alpha_set = 0.95

# what mquantiles method do is find the given percent(by ~prob~) largest
# element of given list(by ~a~)
tau_kde = mquantiles(kde_X, 1. - alpha_set)
print ( tau_kde )

1.2.5 create data points by `meshgrid`

n_samples, n_features = X.shape
X_range = np.zeros((n_features, 2)) # (2,2)

#<- min value of each column(feature) in dataset
#   assign to first column of x_range
# [[feature_1_min,_]
#  [feature_2_min,_]]
X_range[:, 0] = np.min(X, axis=0) - 1.

#<- max value of each column(feature) in dataset
#   assign to first column of x_range
# [[feature_1_min,feature_1_max]
#  [feature_2_min,feature_2_max]]
X_range[:, 1] = np.max(X, axis=0) + 1.

h = 0.1  # step size of the mesh

# compute [min, max] of each feature, to set the meshgrid range
x_min, x_max = X_range[0]
y_min, y_max = X_range[1]

# meshgrid is some-like the full combination of two array
# meshgrid([1,2], [3,4]) => list of array of array: [[1|,2],
#                                                    [1|,2]]
#                                                   [[3||,3],
#                                                    [4||,4]]
# you can create points by select one-axis value from 1st array, eg: 1, 1
# you can create points by select one-axis value from 2nd array, eg: 3, 4
# you get (1,3), (1,4)
xx, yy= np.meshgrid(np.arange(x_min, x_max, h),
                 np.arange(y_min, y_max, h))

# then you flatten xx and yy by ravel(), to get all x-values and y-values
# finally, you stack all x-values and y-values on last axes after post-pended
# which is the functionality of np.c_
grid = np.c_[xx.ravel(), yy.ravel()]

1.2.6 draw the contour covering the points whose log-likelihood are 95% largest

Z_kde = kde.score_samples(grid) # get the decision_function value of each point
                                # here is the log-likelihood value of each point
print(Z_kde)
Z_kde = Z_kde.reshape(xx.shape)
plt.figure()
c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)
plt.clabel(c_0, inline=1, fontsize=15, fmt={tau_kde[0]: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1])
plt.legend()
plt.show()

1.3 now with One-Class SVM

1.3.1 drawbacks of density based estimation

The problem of density based estimation is that they tend to become inefficient when the dimensionality of the data increase. It's the so-called curse of dimensionality that affects particularly density estimation algorithms. The one-class SVM algorithm can be used in such cases.

1.3.2 one-class svm enter

from sklearn.svm import OneClassSVM

1.3.3 what is outlier in SVM view

three kinds of outliers from near to far:

support vector(on the fat margin)
in the fat margin
on the wrong side

All the ourliers have a same property:

\(\theta^T \cdot \textbf{X} + b \leq 1\), that is svm_model.predict(x) = -1

. . (1) (2) (3) (3) . | . .. | . .. | . .. | . .. . | \ ….. | \ ….. | \ ….. | \ ….. . | \ . … | \ . … | \ . … | \ . … . | \ \ \ | \ \ .\ | \ \ \ | \ \ \ . | \ \ \ | \ \ \ | \ .\ \ | .\ \ \ . | *\ \ \ | *\ \ \ | *\ \ \ | *\ \ \ . | * * \ | * * \ | * * \ | * * \ . ------–—----–— ------–—----–— ------–—----–— ------–—----–— . | | | | .

near: SVs can also be seen as outliers, because they must lie on the edge of a group, otherwise they'll not be support vectors.

farther: on the right side in the fat margin

farthest: on the wrong side(inside or outside the margin)

如果要分对所有点，由于 on the wrong side 离群点的存在，我们将无法构造出能将数据分开的超平面来.

用黑圈圈起来的那个蓝点是一个离群点，它偏离了自己原本所应该在的那个半空间，如果直接忽略掉它的话，原来的分隔超平面还是挺好的，但是由于这个离群点的出现，导致分隔超平面不得不被挤歪了，变成途中黑色虚线所示，同时间隔也相应变小了。当然，更严重的情况是，如果这个离群点再往右上移动一些距离的话，我们将无法构造出能将数据分开的超平面来。

1.3.4 how to set the parameter 'nu'

nu = 0.05 # upper bound of the fraction of outliers

nu = 0.05  #  upper bound of the fraction of outliers
ocsvm = OneClassSVM(
    kernel='rbf',
    gamma=0.05,
    nu=nu)
ocsvm.fit(X)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma=0.05, kernel='rbf',
max_iter=-1, nu=0.05, random_state=None, shrinking=True, tol=0.001,
verbose=False)

1.3.5 output detection result

# because this is one-class svm, so just this class or NOT this class
#  1: this class
# -1: NOT this class
X_outliers = X[ocsvm.predict(X) == -1]

1.3.6 [Q] why we have so different contour function for the same problem

c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3) c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)

Note that, in order to draw z-axes based on x and y, we should have a function of x and y, this function :

in one-class-svm is called ocsvm_model.decision_function(point)
in KDE is called kde_model.score_samples(point)

they both say the same thing: z-value(the new created axis) of contour

for one-class-SVM:

*we set concern region(the 95% nearest points against to the separating hyperplane) in SVM model, so levels(decision_function value) of contour is 0*

we give the svm model parameter 'nu' 0.05

nu = 0.05 # upper bound of the fraction of outliers ocsvm = OneClassSVM(kernel='rbf', gamma=0.05, nu=nu) # setup nu of the SVM c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3)

parameter 'levels' of this contour of SVM:

0 is the point on the separating hyperplane.
<0 is the point on the wrong side of separating hyperplane, here is NOT belong this class
>0 is the point on the right side ofseparating hyperplane, here is belong this class

The value we assign to parameter of contour(): levels, is the value of decision_function, because we here use one-class-svm, so all points with negative decision function value are the wrong predicted points, and because we set SVM model parameter 'nu'=0.05, so this model will only guarantee the decision_fn value of points who has the 95% shortest distance larger than 0.

for KDE:

*no parameter about concern region in KDE we can set, but the z-value(log-likelihood of this sample on certain kde, can be computed by score_samples(point)) it self says some thing about the occurence probability, so we order them and keep the largest 95%, and set the 5% z-value in ordered z-value list as the levels of contour*

we give the percent(of largest of z-value) we want to keep to 'alpha_set'

alpha_set = 0.95 tau_kde = mquantiles(kde_X, 1. - alpha_set)

we compute the levels: tau_kde c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)

1.3.7 what is decision_function(), predict() in svm

decision_function:

\(\theta^T \cdot \textbf{X} + b\)

1.3.8 draw the contour and outliers

Z_ocsvm = ocsvm.decision_function(grid) # Signed distance to the separating hyperplane.
Z_ocsvm = Z_ocsvm.reshape(xx.shape)

plt.figure()

c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3)

# note that, we take contour obj as parameter of clabel.
plt.clabel(c_0, inline=1, fontsize=15, fmt={0: str(alpha_set)}) # draw clabel '0.95'
plt.scatter(X[:, 0], X[:, 1])
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='red')
plt.legend()
plt.show()

1.3.9 how to get the outliers

one-class svm obj
model with specifying threshold percentage 'nu'
collect the -1 labeled data point X[svm_model.predict(x) == -1]

1.3.10 how to get the support vectors

The so-called support vectors of the one-class SVM form the outliers

X_SV = X[ocsvm.support_] # support_ attr will return indices of the support
                         # vectors, then we can get it by index it in dataset
print (ocsvm.decision_function(X_SV))
print (ocsvm.decision_function(X_outliers))
print (X_outliers.shape)
print (X_SV.shape)
n_SV = len(X_SV)
n_outliers = len(X_outliers)
print('{0:.2f} <= {1:.2f} <= {2:.2f}?'.format(1./n_samples*n_outliers, nu, 1./n_samples*n_SV))

Only the support vectors are involved in the decision function of the One-Class SVM.

Plot the level sets of the One-Class SVM decision function as we did for the true density.

Emphasize the Support vectors.

fig, axes = plt.subplots(nrows = 1, ncols=2, figsize=(10,5))
axes[0].contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)
axes[1].contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)
axes[0].contour(xx, yy, Z_ocsvm, levels=[0], color= 'red', linewidths=3)
axes[0].scatter(X[:, 0], X[:, 1], s=1.)
axes[1].scatter(X[:, 0], X[:, 1], s=1.)
axes[0].scatter(X_SV[:, 0], X_SV[:, 1], color='orange') # plot the SVs
axes[1].plot(X_outliers[:, 0], X_outliers[:, 1], 'or')  # plot the outliers
plt.show()

1.3.11 support vectors vs. outliers

support vectors: the point just lie on the fat margin
outliers: only satisfying both two conditions we can call it outlier
1. the point is true + / - label, but on predicted - / + side
2. the point whose decision function value lie outside of the 95% cut points of all data points <<< this note use this as evidence to judge whether or not a outlier

1.3.12 EXERCISE

Change the gamma parameter and see it's influence on the smoothness of the decision function.

1.4 now with Isolation Forest

1.4.1 what is isolation forest and why does it work

Isolation Forest is an anomaly detection algorithm based on trees. The algorithm builds a number of random trees and the rationale is that if a sample is isolated it should alone in a leaf after very few random splits. Isolation Forest builds a score of abnormality based the depth of the tree at which samples end up.

1.4.2 build isolation forest model

from sklearn.ensemble import IsolationForest

iforest = IsolationForest(n_estimators=300, contamination=0.10)
iforest = iforest.fit(X)

Z_iforest = iforest.decision_function(grid)
Z_iforest = Z_iforest.reshape(xx.shape)
plt.figure()
c_0 = plt.contour(xx, yy,
                  Z_iforest,
                  levels=[iforest.threshold_],
                  colors='red', linewidths=3)
plt.clabel(c_0,
           inline=1,
           fontsize=15,
           fmt={iforest.threshold_: str(alpha_set)})
plt.scatter(X[:, 0], X[:, 1], s=1.)
plt.legend()
plt.show()

1.4.3 EXERCISE

EXERCISE: Illustrate graphically the influence of the number of trees on the smoothness of the decision function?

1.4.4 apply isolation forest on digits data set

We will now apply the IsolationForest algorithm to spot digits written in an unconventional way.

from sklearn.datasets import load_digits
digits = load_digits()

1.4.4.1 The digits data set consists in images (8 x 8) of digits.

images = digits.images
labels = digits.target
images.shape

(1797, 8, 8)

1.4.4.2 preview the digits image by `imshow`

i = 102
plt.figure(figsize=(2, 2))
plt.title('{0}'.format(labels[i]))
plt.axis('off')
plt.imshow(images[i], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

1.4.4.3 flatten images before using as training data

To use the images as a training set we need to flatten the images.

n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

data.shape

(1797, 64)

X = data
y = digits.target

X.shape

(1797, 64)

1.4.4.4 focus on digit '5' images

Let's focus on digit 5.

X_5 = X[y == 5]

X_5.shape

(182, 64)

fig, axes = plt.subplots(1, 5, figsize=(10, 4))
for ax, x in zip(axes, X_5[:5]):
    img = x.reshape(8, 8) # reshape to a matrix and imshow will map each
                          # element of matrix directly into image with pixels
                          # of same location
    ax.imshow(img, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

1.4.4.5 find the 5% outliers: build model

Let's use IsolationForest to find the top 5% most abnormal images.

Let's plot them !

from sklearn.ensemble import IsolationForest
iforest = IsolationForest(contamination=0.05)
iforest = iforest.fit(X_5)

1.4.4.6 find the 5% outliers: compute the abnormality by `decision_function`

Compute the level of "abnormality" with iforest.decision_function. The lower, the more abnormal.

iforest_X = iforest.decision_function(X_5)
plt.hist(iforest_X);

1.4.4.7 find the 5% outliers: find the strongest inliers by `argsort`

Let's plot the strongest inliers

X_strong_inliers = X_5[np.argsort(iforest_X)[-10:]] # the lower the abnormal
                                                    # the larger the normal
                                                    # the most normal is tail of argsort
fig, axes = plt.subplots(2, 5, figsize=(10, 5))
for i, ax in zip(range(len(X_strong_inliers)), axes.ravel()):
    ax.imshow(X_strong_inliers[i].reshape((8, 8)),
               cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

1.4.4.8 find the 5% outliers: find the strongest outliers

Let's plot the strongest outliers

fig, axes = plt.subplots(2, 5, figsize=(10, 5))

X_outliers = X_5[iforest.predict(X_5) == -1]

for i, ax in zip(range(len(X_outliers)), axes.ravel()):
    ax.imshow(X_outliers[i].reshape((8, 8)),
               cmap=plt.cm.gray_r, interpolation='nearest')
    ax.axis('off')

1.4.5 EXERCISE

EXERCISE: Rerun the same analysis with all the other digits

2 Misc tools

2.1 scikit-learn

2.1.1 ML models by now

from sklearn.datasets import make_blobs

from sklearn.datasets import make_moons

from sklearn.datasets import make_circles

from sklearn.datasets import make_s_curve

from mpl_toolkits.mplot3d import Axes3D

from sklearn.datasets import make_regression

from sklearn.datasets import load_iris

from sklearn.datasets import load_digits

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

from sklearn.model_selection import StratifiedKFold

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import learning_curve

from sklearn.feature_extraction import DictVectorizer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_selection import SelectPercentile

from sklearn.feature_selection import f_classif

from sklearn.feature_selection import f_regression

from sklearn.feature_selection import chi2

from sklearn.feature_selection import SelectFromModel

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge

from sklearn.linear_model import Lasso

from sklearn.linear_model import ElasticNet

from sklearn.neighbors import KNeighborsClassifier

from sklearn.neighbors import KNeighborsRegressor

from sklearn.neighbors.kde import KernelDensity *

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics.scorer import SCORERS

from sklearn.metrics import r2_score

from sklearn.cluster import KMeans

from sklearn.cluster import KMeans

from sklearn.cluster import MeanShift

from sklearn.cluster import DBSCAN # <<< this algorithm has related sources in LIHONGYI's lecture-12

from sklearn.cluster import AffinityPropagation

from sklearn.cluster import SpectralClustering

from sklearn.cluster import Ward

from sklearn.cluster import DBSCAN

from sklearn.cluster import AgglomerativeClustering

from scipy.cluster.hierarchy import linkage

from scipy.cluster.hierarchy import dendrogram

from scipy.stats.mstats import mquantiles

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics import classification_report

from sklearn.preprocessing import Imputer

from sklearn.dummy import DummyClassifier

from sklearn.pipeline import make_pipeline

from sklearn.svm import LinearSVC

from sklearn.svm import SVC

from sklearn.svm import OneClassSVM *

from sklearn.tree import DecisionTreeRegressor

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

from sklearn.manifold import Isomap

2.1.2 OneClassSVM

Unsupervised Outlier Detection.

Estimate the support of a high-dimensional distribution.

The implementation is based on libsvm.

OneClassSVM(kernel=’rbf’,
            degree=3,
            gamma=’auto’, # Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
                          # If gamma is ‘auto’ then 1/n_features will be used
                          # instead.
            coef0=0.0,    # Independent term in kernel function. It is only
                          # significant in ‘poly’ and ‘sigmoid’.
            tol=0.001,
            nu=0.5,       # An upper bound on the fraction of training errors
                          # and a lower bound of the fraction of support
                          # vectors
            shrinking=True, # Whether to use the shrinking heuristic.
            cache_size=200, # Specify the size of the kernel cache (in MB)
            verbose=False,
            max_iter=-1,
            random_state=None)

attributes:

support_ : array-like, shape = [n_SV].

Indices of support vectors.

support_vectors_ : array-like, shape = [nSV, n_features].

Support vectors.

dual_coef_ : array, shape = [1, n_SV].

Coefficients of the support vectors in the decision function.

coef_ : array, shape = [1, n_features]

Weights assigned to the features (coefficients in the primal problem). This is only available in the case of a linear kernel. coef_ is readonly property derived from dual_coef_ and support_vectors_

intercept_ : array, shape = [1,]

Constant in the decision function.

2.2 Numpy

2.2.1 np.c_

last axis
upgrade to at least 2-D
1's post-pend

Translates slice objects to concatenation along the second axis.

This is short-hand for np.r_['-1,2,0', index expression], which is useful because of its common occurrence. In particular, arrays will be stacked along their last axis after being upgraded to at least 2-D with 1’s post-pended (3,) –> (3,1) to the shape (column vectors made out of 1-D arrays).

import numpy as np
np.c_[np.array([1,2,3]), np.array([4,5,6])]
# (3,) post-pended to (3,1)
# [1,2,3] post-pended to [[1],  [[4,
# [4,5,6]                 [2],   [5],
#                         [3]]   [6]]
# then do:
# | | |
# | | |
# | | |

array([[1, 4],
[2, 5],
[3, 6]])

np.c_[np.array([ [1,2,3] ]), 0, 0, np.array([ [4,5,6] ])]
# (1,3) dont need to append
# (1,) append to (1,1)
# then do:
# | | |
# | | |
# | | |

array([[1, 2, 3, 0, 0, 4, 5, 6]])

2.2.2 np.r_

first axis
upgrade to at least 2-D
1's post-pend

stack in unit of first axes, that is row in common case.

np.r_[np.array([ [1,2,3] ]), np.array([ [4,5,6] ])]

array([[1, 2, 3],
[4, 5, 6]])

2.3 Scipy

2.3.1 scipy.stats.mstats.mquantiles

what mquantiles method do is find the given percent(by prob) largest element of given list(by a)

mquantiles(a,                      #<- array-like, input data
           prob=[0.25, 0.5, 0.75], #<- list of quantiles to compute
           alphap=0.4,             #<- plotting position parameter
           betap=0.4,              #<- plotting position parameter
           axis=None,              #<- axis along which to perform trimming
           limit=()
)

Computes empirical quantiles for a data array.

2.4 Statistics

2.4.1 TODO quantiles in statistics

Not 1/4, but cut points.

https://www.wikiwand.com/en/Quantile

In statistics and probability quantiles are cut points dividing the range of a probability distribution into contiguous intervals with equal probabilities, or dividing the observations in a sample in the same way.

In above image, 3 quantiles(cut points): Q1, Q2, Q3, create 4 probability-equal region ==> 1/4, each region has 25% probability

if you have 2 quantiles, they will create 3 probability-equal region ===> 1/3, each region has 33.3333% probability if you have 1 quantiles, they will create 2 probability-equal region ===> 1/2, each region has 50% probability

2.5 Matplotlib

2.5.1 module by now

from mpl_toolkits.mplot3d import Axes3D *

2.5.2 plt.contour

Note that, what we assign to levels, is the Z_ocsvm or Z_kde or Z_iforest's value, which is computed by decision_function(xx, yy)

c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3) c_0 = plt.contour(xx, yy, Z_kde, levels=tau_kde, colors='red', linewidths=3)

2.5.3 plt.contourf

contour() draw contour lines

Z_ocsvm = ocsvm.decision_function(grid) # Signed distance to the separating hyperplane.
Z_ocsvm = Z_ocsvm.reshape(xx.shape)

plt.figure()

c_0 = plt.contour(xx, yy, Z_ocsvm, levels=[0], colors='red', linewidths=3)

# note that, we take contour obj as parameter of clabel.
plt.clabel(c_0, inline=1, fontsize=15, fmt={0: str(alpha_set)}) # draw clabel '0.95'
plt.scatter(X[:, 0], X[:, 1])
plt.scatter(X_outliers[:, 0], X_outliers[:, 1], color='red')
plt.legend()
plt.show()

contourf() draw filled contours

fig, axes = plt.subplots(nrows = 1, ncols=2, figsize=(10,5))
axes[0].contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)
axes[1].contourf(xx, yy, Z_ocsvm, 10, cmap=plt.cm.Blues_r)
axes[0].scatter(X[:, 0], X[:, 1], s=1.)
axes[1].scatter(X[:, 0], X[:, 1], s=1.)
axes[0].scatter(X_SV[:, 0], X_SV[:, 1], color='orange') # plot the SVs
axes[1].plot(X_outliers[:, 0], X_outliers[:, 1], 'or')  # plot the outliers
plt.show()

3 code snippet

3.1 how to draw one digt in one subplot

from sklearn.datasets import load_digits
digits = load_digits()
fig, axes = plt.subplots(2, 5, figsize=(10, 5),
                         subplot_kw={'xticks':(), 'yticks': ()})
for ax, img in zip(axes.ravel(), digits.images):
    ax.imshow(img, interpolation="none", cmap="gray")

3.2 how to generate 2-d points with 3 cluster

from sklearn.datasets import make_blobs
X, y = make_blobs(n_features=2, centers=3, n_samples=500,
                  random_state=42)

4 scikit learn guide

4.1 2.8. Density Estimation

https://www.youtube.com/watch?v=gPWsDh59zdo

we put weight '1' (in bold font) on observations in the same bin, and '0' otherwise;

but kernel density puts continuous weight that's decreasing the further we move away from the the point 'x' (in red font color)

KDE has the same intuition sense that it's an average: divide by b(bandwidth). KDE is kind of the equivalent of the width of the bins of the Kernel function.

Kernel function weighs observations differently depending on how far away they are from the point x( the one we're evaluating in f(x) ): \((x_i - x)\), so we sum over all observations.

one of the kernel function is Gaussian density — the PDF of the normal.

lower bandwidth, overfitting, unsmooth, variance with many peaks, so our predict PDF showed here moves from up and down too much

larger bandwidth, underfitting, smooth, bias with only ONE peak

normal bandwidth, similar to true distribution, with two peak

Density estimation walks the line between unsupervised learning, feature engineering, and data modeling. Some of the most popular and useful density estimation techniques are mixture models such as Gaussian Mixtures (sklearn.mixture.GaussianMixture), and neighbor-based approaches such as the kernel density estimate (sklearn.neighbors.KernelDensity). Gaussian Mixtures are discussed more fully in the context of clustering, because the technique is also useful as an unsupervised clustering scheme.

Density estimation is a very simple concept, and most people are already familiar with one common density estimation technique: the histogram.

4.1.1 2.8.1. Density Estimation: Histograms

A histogram is a simple visualization of data where bins are defined, and the number of data points within each bin is tallied. An example of a histogram can be seen in the upper-left panel of the following figure:

A major problem with histograms, however, is that the choice of binning can have a disproportionate effect on the resulting visualization.

Consider the upper-right panel of the above figure. It shows a histogram over the same data, with the bins shifted right. The results of the two visualizations look entirely different, and might lead to different interpretations of the data.

Intuitively, one can also think of a histogram as a stack of blocks, one block per point. By stacking the blocks in the appropriate grid space, we recover the histogram.

But what if, instead of stacking the blocks on a regular grid, we center each block on the point it represents, and sum the total height at each location? This idea leads to the lower-left visualization. It is perhaps not as clean as a histogram, but the fact that the data drive the block locations mean that it is a much better representation of the underlying data.

This visualization is an example of a kernel density estimation, in this case with a top-hat kernel (i.e. a square block at each point). We can recover a smoother distribution by using a smoother kernel. The bottom-right plot shows a Gaussian kernel density estimate, in which each point contributes a Gaussian curve to the total. The result is a smooth density estimate which is derived from the data, and functions as a powerful non-parametric model of the distribution of points.

4.1.2 2.8.2. Kernel Density Estimation

Kernel density estimation in scikit-learn is implemented in the sklearn.neighbors.KernelDensity estimator, which uses the Ball Tree or KD Tree for efficient queries (see Nearest Neighbors for a discussion of these).

Though the above example uses a 1D data set for simplicity, kernel density estimation can be performed in any number of dimensions, though in practice the curse of dimensionality causes its performance to degrade in high dimensions.

In the following figure, 100 points are drawn from a bimodal distribution, and the kernel density estimates are shown for three choices of kernels:

It’s clear how the kernel shape affects the smoothness of the resulting distribution. The scikit-learn kernel density estimator can be used as follows:

>>> from sklearn.neighbors.kde import KernelDensity
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)
>>> kde.score_samples(X)
array([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698,
       -0.41076071])

4.1.2.1 mathematical definition

Here we have used kernel='gaussian', as seen above. Mathematically, a kernel is a positive function K(x;h) which is controlled by the bandwidth parameter h. Given this kernel form, the density estimate at a point y within a group of points \(x_i; i=1\cdots N\) is given by:

\(\rho_K(y) = \sum_{i=1}^{N} K((y - x_i) / h)\)

4.1.2.2 what is a bandwidth in KDE

The bandwidth here acts as a smoothing parameter, controlling the tradeoff between bias and variance in the result.

large bandwidth leads to a very smooth (i.e. high-bias) density distribution.
small bandwidth leads to an unsmooth (i.e. high-variance) density distribution.

sklearn.neighbors.KernelDensity implements several common kernel forms, which are shown in the following figure:

4.1.2.3 various kinds of Kernels

The form of these kernels is as follows:

Gaussian kernel (kernel = 'gaussian')

\(K(x; h) \propto \exp(- \frac{x^2}{2h^2} )\)

Tophat kernel (kernel = 'tophat')

\(K(x; h) \propto 1 if x < h\)

Epanechnikov kernel (kernel = 'epanechnikov')

\(K(x; h) \propto 1 - \frac{x^2}{h^2}\)

Exponential kernel (kernel = 'exponential')

\(K(x; h) \propto \exp(-x/h)\)

Linear kernel (kernel = 'linear')

\(K(x; h) \propto 1 - x/h if x < h\)

Cosine kernel (kernel = 'cosine')

\(K(x; h) \propto \cos(\frac{\pi x}{2h}) if x < h\)

4.1.2.4 distance metrics of KDE

The kernel density estimator can be used with any of the valid distance metrics (see sklearn.neighbors.DistanceMetric for a list of available metrics), though the results are properly normalized only for the Euclidean metric. One particularly useful metric is the Haversine distance which measures the angular distance between points on a sphere.

4.1.2.5 common applications of KDE: species density in South Amercican

Here is an example of using a kernel density estimate for a visualization of geospatial data, in this case the distribution of observations of two different species on the South American continent:

4.1.2.6 common applications of KDE: generate digits based on given digits

One other useful application of kernel density estimation is to learn a non-parametric generative model of a dataset in order to efficiently draw new samples from this generative model. Here is an example of using this process to create a new set of hand-written digits, using a Gaussian kernel learned on a PCA projection of the data:

The “new” data consists of linear combinations of the input data, with weights probabilistically drawn given the KDE model.

4.1.2.7 Examples

Simple 1D Kernel Density Estimation: computation of simple kernel density estimates in one dimension.

Kernel Density Estimation: an example of using Kernel Density estimation to learn a generative model of the hand-written digits data, and drawing new samples from this model.

Kernel Density Estimate of Species Distributions: an example of Kernel Density estimation using the Haversine distance metric to visualize geospatial data

4.2 Outlier detection with several methods

4.2.1 4 general methods to do outliers detection

When the amount of contamination is known, this example illustrates three different ways of performing Novelty and Outlier Detection:

based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs better than the One-Class SVM in that case.
using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the data is strongly non-Gaussian, i.e. with two well-separated clusters;
using the Isolation Forest algorithm, which is based on random forests and hence more adapted to large-dimensional settings, even if it performs quite well in the examples below.
using the Local Outlier Factor to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.

svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05, kernel="rbf", gamma=0.1)

EllipticEnvelope(contamination=outliers_fraction)

IsolationForest(max_samples=n_samples, contamination=outliers_fraction, random_state=rng)

LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction)

4.2.2 performance illustration of 4 general methods

The ground truth about inliers and outliers is given by the points colors while the orange-filled area indicates which points are reported as inliers by each method.

Here, we assume that we know the fraction of outliers in the datasets. Thus rather than using the ‘predict’ method of the objects, we set the threshold on the decision_function to separate out the corresponding fraction.

4.2.3 code snippet

import numpy as np from
scipy import stats import matplotlib.pyplot as plt import
matplotlib.font_manager

from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

print(__doc__)

rng = np.random.RandomState(42)

# Example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]

# define two outlier detection tools to be compared
classifiers = {
    "One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
                                     kernel="rbf", gamma=0.1),
    "Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
    "Isolation Forest": IsolationForest(max_samples=n_samples,
                                        contamination=outliers_fraction,
                                        random_state=rng),
    "Local Outlier Factor": LocalOutlierFactor(
        n_neighbors=35,
        contamination=outliers_fraction)}

# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 100), np.linspace(-7, 7, 100))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)
ground_truth = np.ones(n_samples, dtype=int)
ground_truth[-n_outliers:] = -1

# Fit the problem with varying cluster separationfor i, offset in enumerate(clusters_separation):
    np.random.seed(42)
    # Data generation
    X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
    X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
    X = np.r_[X1, X2]
    # Add outliers
    X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]

    # Fit the model
    plt.figure(figsize=(9, 7))
    for i, (clf_name, clf) in enumerate(classifiers.items()):
        # fit the data and tag outliers
        if clf_name == "Local Outlier Factor":
            y_pred = clf.fit_predict(X)
            scores_pred = clf.negative_outlier_factor_
        else:
            clf.fit(X)
            scores_pred = clf.decision_function(X)
            y_pred = clf.predict(X)
        threshold = stats.scoreatpercentile(scores_pred,
                                            100 * outliers_fraction)
        n_errors = (y_pred != ground_truth).sum()
        # plot the levels lines and the points
        if clf_name == "Local Outlier Factor":
            # decision_function is private for LOF
            Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        subplot = plt.subplot(2, 2, i + 1)
        subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
                         cmap=plt.cm.Blues_r)
        a = subplot.contour(xx, yy, Z, levels=[threshold],
                            linewidths=2, colors='red')
        subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
                         colors='orange')
        b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white',
                            s=20, edgecolor='k')
        c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black',
                            s=20, edgecolor='k')
        subplot.axis('tight')
        subplot.legend(
            [a.collections[0], b, c],
            ['learned decision function', 'true inliers', 'true outliers'],
            prop=matplotlib.font_manager.FontProperties(size=10),
            loc='lower right')
        subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
        subplot.set_xlim((-7, 7))
        subplot.set_ylim((-7, 7))
    plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
    plt.suptitle("Outlier detection")

plt.show()

Total running time of the script: ( 0 minutes 2.847 seconds)