lec 04 Soft-SVM and Features

lec 04 Soft-SVM and Features

lec 04 Soft-SVM and Features

参考 Tsinghua-Datamining chap5 SVM page 15

SOFT-MARGIN SUPPORT VECTOR MACHINES (SVMs)

=================================

SVMs solves 2 problems:

Hard-margin SVMs fail if data not linearly separable.
" " " sensitive to outliers.

[Show example where one outlier moves the decision boundary a lot.]

Idea: Allow some samples to violate the margin, with slack_variables. Modified constraint for sample i:

y_i (X_i . w + alpha) >= 1 - xi_i

[Observe that the only difference between these constraints and the hard margin constaints we saw last lecture is the extra slack term xi_i.] [We also impose new constraints, that the slack variables are never negative.]

xi_i >= 0

[This inequality ensures that all samples that don't violate the margin are treated the same; they all have xi_i = 0. Sample i has nonzero xi_i if and only if it violates the margin.]

[Show figure of margin where some samples have slack. For each violating point, the slack distance is xi*_i = xi_i / |w|.]

To prevent abuse of slack, we add a loss_term to objective fn.

Optimization problem:

----------------------------------------------------------------------
|                                                  n                 |
| Find w, alpha, and xi_i that minimize |w|^2 + C sum xi_i           |
|                                                 i=1                |
| subject to y_i (X_i . w + alpha) >= 1 - xi_i   for all i in [1, n] |
|            xi_i >= 0                           for all i in [1, n] |
----------------------------------------------------------------------

...a quadratic program in d + n + 1 dimensions and 2n constraints. [It's a quadratic program because its objective function is quadratic and its constraints are linear inequalities.]

C > 0 is a scalar regularization_hyperparameter that trades off:

          |------------------------|------------------------------------------|
          | small C                | big C                                    |
----------|------------------------|------------------------------------------|
 desire   | maximize margin 1/|w|  | keep most slack variables zero or small  |
----------|------------------------|------------------------------------------|
 danger   | underfitting           | overfitting                              |
          | (misclassifies much    | (awesome training, awful test)           |
          |  training data)        |                                          |
----------|------------------------|------------------------------------------|
 outliers | less sensitive         | very sensitive                           |
----------|------------------------|------------------------------------------|
 boundary | more "flat"            | more sinuous                             |
-------------------------------------------------------------------------------

[The last row only applies to nonlinear decision boundaries, which we'll discuss next. Obviously, a linear decision boundary can't be sinuous.]

Use validation to choose C.

[Show examples of how slab varies with C. Smallest C upper left; largest C lower right.] [One way to think about slack is to pretend that slack is money we can spend to buy permission for a sample to violate the margin. The further a sample penetrates the margin, the bigger the fine you have to pay. We want to make the margin as big as possible, but we also want to spend as little money as possible. If the regularization parameter C is small, it means we're willing to spend lots of money on violations so we can get a bigger margin. If C is big, it means we're cheap and we want to prevent violations, even though we'll get a narrower margin . If C is infinite, we're back to a hard-margin SVM.]

FEATURES

======

Q: How to do nonlinear decision boundaries?

A: Make nonlinear features that lift samples into a higher-dimensional space. High-d linear classifier -> low-d nonlinear classifier.

[Features work with all classifiers, including perceptrons, hard-margin SVMs, and soft-margin SVMs.]

Example 1: The parabolic_lifting_map

Phi(x): R^d -> R^{d+1}
         -       -
         |   x   |                                          2
Phi(x) = |       |   <--- lifts x onto paraboloid x    = |x|
         | |x|^2 |                                 d+1
         -       -

[We've added a new feature, |x|^2. Even though the new feature is just a function of other input features, it gives our linear classifier more power.]

Find a linear classifier in Phi-space. It induces a sphere classifier in x-space.

        ^    X  X
        |         X
        |  X  O     X
        | X   O O  X                [Draw paraboloid, lifted samples, and
        |X   O  OO                   plane decision boundary in 3D here.]
        |  X O O  X X
        |XX X    X
        |     X   X
        O----------->
[Draw circle decision boundary]

单词：ellipsoid, 椭圆体; sphere , 球体 hyperboloid, 双曲面 paraboloid, 抛物面

Theorem:  Phi(X_1), ..., Phi(X_n) are linearly separable iff X_1, ..., X_n are
          separable by a hypersphere.
          (Possibly a _degenerate_ hypersphere = hyperplane.)

根据这个图可以看出，朝平面是一个特殊的球体; 同样的，也可以把一条线理解为一个特殊的圆形。

Proof: Consider hypersphere in R^d w/center c & radius rho.

Points inside:  |x - c|^2 < rho^2
                |x|^2 - 2c . x + |c|^2 < rho^2
                           -       -
                           |   x   |
                [-2c^T  1] |       | < rho^2 - |c|^2
                           | |x|^2 |
                           -       -
        normal vector ^        ^ Phi(x)

Hence points inside sphere -> same side of hyperplane in Phi-space. (Reverse implication works too.)

[Although the math above doesn't expose it, hyperplane separators are a special case of hypersphere separators, so hypersphere classifiers can do everything linear classifiers can do and more. If you take a sphere and increase its radius to infinity while making it pass through some point, in the limit you get a plane; so you can think of a plane as a degenerate sphere. With the parabolic lifting map, a hyperplane in x-space corresponds to a hyperplane in Phi-space that is parallel to the x_{d+1}-axis.]

Example 2: Axis-aligned ellipsoid/hyperboloid decision boundaries

[Draw examples of axis-aligned ellipses & hyperbola.]

In 3D, these have the formula

A x_1^2 + B x_2^2 + C x_3^2 + D x_1 + E x_2 + F x_3 = -G

[Here, the capital letters are scalars, not matrices.]

Phi(x): R^d -> R^{2d} Phi(x) = [ x_1^2 ... x_d^2 x_1 ... x_d ]^T

[We've turned d input features into 2d features for our linear classifier. If the samples are separable by an axis-aligned ellipsoid or hyperboloid, per the formula above, then the samples lifted to Phi-space are separable by a hyperplane whose normal vector is (A, B, C, D, E, F).]

Example 3: Ellipsoid/hyperboloid

[Draw example of non-axis-aligned ellipse.]

General formula: [for an ellipsoid or hyperboloid]

A x_1^2 + B x_2^2 + C x_3^2 + D x_1 x_2 + E x_2 x_3 + F x_3 x_1 +
G x_1 + H x_2 + I x_3 = -J

Phi(x): R^d -> R^{(d^2+3d)/2}

[The isosurface defined by this equation is called a quadric. In the special case of 2D, it's also known as a conic_section.]

[You'll notice that there is a quadratic blowup in the number of features, because every pair of input features creates a new feature in Phi-space. If the dimension is large, these feature vectors are getting huge, and that's going to impose a serious computational cost. But it might be worth it to find good classifiers for data that aren't linearly separable.]

Example 4: Predictor fn is degree-p polynomial

E.g. a cubic in R^2:

                                                                              T
Phi(x) = [ x_1^3  x_1^2 x_2  x_1 x_2^2  x_2^3  x_1^2  x_1 x_2  x_2^2  x_1  x_2]
Phi(x): R^d -> R^{O(d^p)}

[Now we're really blowing up the number of features! If you have, say, 100 features per sample and you want to use degree-4 predictor functions, then each lifted feature vector has a length on the order of 100 million, and your learning algorithm will take approximately forever to run.] [However, later in the semester we will learn an extremely clever trick that allows us to work with these huge feature vectors very quickly, without ever computing them. It's called "kernelization" or "the kernel trick". So even though it appears now that working with degree-4 polynomials is computationally infeasible, it can actually be done very quickly.]

[Show SVMs with degree 1/2/5 predictor functions. Observe that the margin tends to get wider as the degree increases.]

[Increasing the degree like this accomplishes two things.

First, the data might become linearly separable when you lift them to a high enough degree, even if the original data are not linearly separable.
Second, raising the degree can increase the margin, so you might get a more robust separator.

However, if you raise the degree too high, you will overfit the data.]

[Show training vs. test error for degree 1/2/5 predictor functions. In this example, a degree-2 predictor gives the smallest test error.]

[Sometimes you should search for the ideal degree--not too small, not too big. It's a balancing act between underfitting and overfitting. The degree is an example of a hyperparameter that can be optimized by validation.]

[If you're using both polynomial features and a soft-margin SVM, now you have two hyperparameters: the degree and C. Generally, the optimal C will be different for ever polynomial degree, so when you change degree, you have to run validation again to find the best C for that degree.]

[So far I've talked only about polynomial features. But features can get much more interesting than polynomials, and they can be tailored to fit a specific problem. Let's consider a type of feature you might use if you wanted to implement, say, a handwriting recognition algorithm.]

Example 5: Edge detection

Edge_detector: algorithm for approximating grayscale/color gradients in image, e.g.

tap filter
Sobel filter
oriented Gaussian derivative filter

[images are discrete, not continuous fields, so approximation is necessary.]

[See "Image Derivatives" on Wikipedia.]

Collect line orientations in local histograms (each having 12 orientation bins per region); use histograms as features (instead of raw pixels).

[Show picture of image histograms.]

Paper: Maji & Malik, 2009.

[If you want to, optionally, use these features in Homework 1 and try to win the Kaggle competition, this paper is a good online resource.]

[When they use a linear SVM on the raw pixels, Maji & Malik get an error rate of 15.38% on the test set. When they use a linear SVM on the histogram features, the error rate goes down to 2.64%.]

[Many applications can be improved by designing application-specific features. There's no limit but your own creativity and ability to discern the structure hidden in your application.]

Summarize of lec-4

首先，通过 线性方程 得到一个 hyperplane; 其次，通过升维把所有问题都转化为 过原点 问题; 其次，通过对偶把所有问题都转化为 优化 w 空间 问题; 其次，通过 yi 技巧 得到线性可分问题的 constraint-fn; 其次，通过 所有样本误差之和 可以实现 GD; 其次，通过 单个样本误差 + convex-loss-fn 可以实现 SGD; 其次，通过 SVMh 解决 GD/SGD 的 biased; 其次，通过 hyperparameter: C of SVMs, k of k-nearst, p of polynomial 控制 under/over fitting; 其次，通过 +-1 和 yi 技巧 得到 SVMh 问题的 constraint-fn; 其次，通过 引入ξ(called 'zai') 得到 SVMs 解决 SVMh 的 outliers; 其次，通过 ξ and C 得到 SVMs 问题的 constraint-fn; 其次，通过升维把线性不可分问题转为高阶线性可分问题; 其次，通过 概率分类 解决 same point at coordinate, with different class;

lec 04 Soft-SVM and Features

Table of Contents