lec 13 RIDGE REGRESSION
Table of Contents
lec 13 RIDGE REGRESSION (aka Tikhonov regularization)
==============
(1) + (A) + l_2 penalized mean loss (d).
-------------------------------------------------- | 2 2 | | Find w that minimizes |Xw - y| + lambda |w'| | = J(w) --------------------------------------------------
where w' is w with component alpha replaced by 0. X has fictitious dimension but we DON'T penalize alpha.
Adds a penalty term to encourage small |w'|--called shrinkage. Why?
- Guarantees positive definite normal eq's; always unique solution. [When samples lie on a common hyperplane in feature space.] E.g. when d > n. [Show figure comparing +ve semidef bowl w/+ve definite bowl (ridgequad.png)] [That was the original motivation, but the next has become more important...]
Reduces overfitting by reducing variance. Why? Imagine: 275238 x_1^2 - 543845 x_1 x_2 + 385832 x_2^2 is best fit for well-spaced samples all with |y| < 1. Small change in x => big change in y! [Given that all the y values in the data are small, it's a sure sign of overfitting if tiny changes in x cause huge changes in y.] So we penalize large weights. [Show plot of isocontours for the two terms of J (ridgeterms.pdf)] [In this plot, beta is the least-squares solution. The red ellipses are the isocontours of |Xw - y|^2. The isocontours of |w'| are circles centered at the origin (blue). The solution lies where a red isocontour just touches a blue isocontour tangentially. As lambda increases, the solution will occur at a more outer red isocontour and a more inner blue isocontour.]
Setting grad J = 0 gives normal eq'ns T T (X X + lambda I') w = X y where I' is identity matrix w/bottom right set to zero. T Algorithm: Solve for w. Return h(z) = w z. Increasing lambda => more _regularization_; smaller |w'| Given our data model y = Xv + e, where e is noise, T T -1 T variance of ridge reg. is Var(z (X X + lambda I') X e). As lambda -> infinity, variance -> 0, but bias increases.
[Show graph of bias^2 & variance as lambda increases (ridgebiasvar.pdf)] [So, as usual for the bias-variance trade-off, the test error as a function of lambda is a U-shaped curve, and we find the bottom by validation.]
lambda is a hyperparameter; tune by (cross-)validation.
Ideally, features should be "normalized" to have same variance. Alternative: use asymmetric penalty by replacing I' w/other diagonal matrix.
Bayesian justification for ridge reg.
Assign a prior probability on w': a Gaussian centered at 0. Posterior prob ~ likelihood of w x prior P(w') <= Gaussian PDF Maximize log posterior = ln likelihood + ln P(w') = - const |Xw - y|^2 - const |w'|^2 - const This method (using likelihood, but maximizing posterior) is called _maximum_a_posteriori_ (MAP).
lec 13.2 KERNELS
=====
Recall: with d input features, degree-p polynomials blow up to O(d^p) features.
[When d is large, this gets computationally intractible really fast.
As I said in Lecture 4, if you have 100 features per sample and you want to
use degree-4 predictor functions, then each lifted feature vector has a length
on the order of 100 million.]
Today we use magic to use those features without computing them!
Observation: In many learning algs,
- the weights can be written as a linear combo of input samples, &
- we can use inner products of Phi(x)'s only => don't need to compute Phi(x)!
T n n Suppose w = X a = sum a X for some a in R . i=1 i i
Substitute this identity into alg. and optimize n dual_weights a (aka dual_parameters) instead of d+1 primal_weights w.
Kernel Ridge Regression
Center X and y so their means are zero; e.g. X_i <- X_i - mu_X This lets us replace I' with I in normal equations:
T T (X X + lambda I) w = X y
[When we center X and y, the expected value of the intercept w_{d+1} = alpha is zero. The actual value won't usually be exactly zero, but it will be close enough that we won't do much harm by penalizing the intercept.]
1 T T T 1 => w = ------ (X y - X X w) = X a where a = ------ (y - X w) lambda lambda ^ T | w = X a This shows that w is a linear combo of samples. To compute a: T T -1 lambda a = (y - XX a) => a = (XX + lambda I) y a is the _dual_solution_; solves the _dual_form_ of ridge regression: ------------------------------------------------------ | T 2 T 2 | | Find a that minimizes |XX a - y| + lambda |X a| | ------------------------------------------------------
Regression fn is
T T n T h(z) = w z = a X z = sum a (X z) <= weighted sum of inner products i=1 i i T Let k(x, z) = x z be _kernel_fn_. [Later, we'll replace x and z with Phi(x) and Phi(z), and that's where the magic will happen.] T Let K = XX be n-by-n _kernel_matrix_. Note K = k(X , X ). ij i j K is singular if n > d. [And sometimes otherwise.] In that case, no solution if lambda = 0. [Then the penalty term is necessary.]
Summary of kernel ridge reg.:
K = k(X , X ) for all i, j <= O(n^2 d) time ij i j Solve (K + lambda I) a = y for a <= O(n^3) time for each test pt z n h(z) = sum a k(X , z) <= O(nd) time i=1 i i Does not use X directly! Only k. [This will become important soon.]
Dual: solve n-by-n linear system Primal: solve d-by-d linear system
[So we prefer the dual form when d > n. If we add new features like polynomial terms, this d goes up, but the d in the kernel running time doesn't.]
The Kernel Trick (aka kernelization)
[Here's the magic part. We will see that we can compute a polynomial kernel that involves many monomial terms without actually computing those terms.]
T p The _polynomial_kernel_ of degree p is k(x, z) = (x z + 1) T p T Theorem: (x z + 1) = Phi(x) Phi(z) where Phi(x) contains every monomial in x of degree 0...p. Example for d = 2, p = 2: T 2 2 2 2 2 (x z + 1) = x z + x z + 2 x z x z + 2 x z + 2 x z + 1 1 1 2 2 1 1 2 2 1 1 2 2 2 2 _ _ _ = [x x \/2 x x \/2 x \/2 x 1] 1 2 1 2 1 2 2 2 _ _ _ T [z z \/2 z z \/2 z \/2 z 1] 1 2 1 2 1 2 T = Phi(x) Phi(z) [This is how we're defining Phi.]
[Notice the factors of sqrt(2). If you try a higher polynomial degree p, you'll see a wider variety of these constants. We have no control of the constants used in Phi(x), but they don't matter very much, because the primal weights w will scale themselves to compensate. Even though we aren't directly computing the primal weights...they still implicitly exist.]
Key win: compute Phi(x)^T Phi(z) in O(d) time instead of O(d^p), even though Phi(x) has length O(d^p). Kernel ridge regression replaces X_i with Phi(X_i): T Let k(x, z) = Phi(x) Phi(z)
[I think what we've done here is pretty mind-blowing: we can now do polynomial regression with an exponentially long, high-order polynomial in less time than it would take even to compute the final polynomial. The running time is sublinear, actually much smaller than linear, in the size of the Phi vectors.]