lec 11.1 WEIGHTED LEAST-SQUARES REGRESSION
Table of Contents
lec 11.1 WEIGHTED LEAST-SQUARES REGRESSION
===============================
linear regression fn (1) + squared loss fn (A) + cost fn (c).
[The idea of weighted least-squares is that some samples might be more trusted than others, or there might be certain samples you want to fit particularly well. So you assign those more trusted samples a higher weight. If you suspect some samples of being outliers, you can assign them a lower weight.]
Assign each sample a weight omega_i; collect them in nxn diagonal matrix Omega. ^ ^ Greater omega_i -> work harder to minimize |y_i - y_i| recall: y = Xw -------------------------------------------------- | T | n 2 | Find w that minimizes (Xw - y) Omega (Xw - y) | = sum omega ((Xw) - y ) -------------------------------------------------- i=1 i i i T T Solve for w in normal equations: X Omega Xw = X Omega y [Once again, you can interpret it as a projection:] 1/2 ^ 1/2 Note: Omega y is orthogonal projection of Omega y onto 1/2 subspace spanned by columns of Omega X.
[If you stretch the n-dimensional space by applying the linear transformation Omega^1/2, it's an orthogonal projection in that stretched space.]
lec 11.2 NEWTON'S METHOD
=============
Iterative optimization method for smooth fn J(w).
Often much faster than gradient descent. [We'll use for logistic regression.]
Idea: You're at point v. Approximate J(w) near v by quadratic fn. Jump to its unique critical pt. Repeat until bored.
[Show method in 1D (newton1.pdf--newton3.pdf); in 2D (newton2D.png).]
Taylor series about v: 2 2 grad J(w) = grad J(v) + (grad J(v)) (w - v) + O(|w - v| ) where grad^2 J(v) is the _Hessian_matrix_ of J(w) at v. Find critical pt w by setting grad J(w) = 0: 2 -1 w = v - (grad J(v)) grad J(v)
[This is an iterative update rule you can repeat until it converges to a solution. As usual, we don't really want to compute a matrix inverse.]
Newton's method: pick starting point w repeat until convergence 2 e <- solution to linear system (grad J(w)) e = - grad J(w) w <- w + e
Warning: Doesn't know difference between minima, maxima, saddle points. Starting point must be "close enough" to desired solution.
[If the objective function J is actually quadratic, Newton's method needs only one step to find the correct answer. The closer J is to quadratic, the faster Newton's method tends to converge.] [Newton's method is superior to blind gradient descent for some optimization problems for several reasons. First, it tries to find the right step length to reach the minimum, rather than just walking an arbitrary distance downhill. Second, rather than follow the direction of steepest descent, it tries to optimize all directions at once.] [Nevertheless, it has some major disadvantages. The biggest one is that computing the Hessian can be quite expensive, and it has to be recomputed every iteration. It can work well for low-dimensional feature spaces, but you would never use it for a neural net, because there are too many weights. Newton's method also doesn't work for most nonsmooth functions. It particularly fails for the perceptron risk function, whose Hessian is zero, except where it's not even defined.]
lec 11.3 LOGISTIC REGRESSSION (continued)
==================
Recall: s'(gamma) = s(gamma) (1 - s(gamma)) s = s(X . w) i i n T grad J(w) = sum (y - s ) X = X (y - s) w i=1 i i i where s is n-vector with components s_i [Now we work out the Hessian too, so we can use Newton's method.] 2 n T T grad J(w) = - sum s (1 - s ) X X = - X Omega X w i=1 i i i i where Omega is diagonal matrix with components s (1 - s ) i i T Omega is +ve definite for all w, so X Omega X is +ve semidefinite, so -J(w) is convex. [Therefore, Newton's method finds an optimal point if it converges at all.] Newton's method: T T Solve for e in normal equations: (X Omega X) e = X (y - s) w <- w + e ^ ^ repeat until "convergence" Remember: Omega, s are fns of w
[Notice that this looks a lot like weighted least squares, but the weight matrix Omega and the right-hand-side vector y-s change every iteration.] An example of iteratively_reweighted_least_squares. Omega prioritizes samples with s_i near 0.5; tunes out samples near 0/1. [In other words, samples near the decision boundary have the biggest effect on the iterations. Meanwhile, the iterations move the decision boundary; in turn, that movement may change which samples have the most influence. In the end, only the samples near the decision boundary make a big contribution to the logistic fit.]
Idea: If n very large, save time by using a random subsample of the samples per iteration. Increase sample size as you go.
[The principle is that the first iteration isn't going to take you all the way to the optimal point, so why waste time looking at all the samples? Whereas the last iteration should be the most accurate one.]
LDA vs. logistic regression
Advantages of LDA:
- For well-separated classes, LDA stable; log. reg. surprisingly unstable
- > 2 classes easy & elegant, log. reg. needs modifying ("softmax regression")
- Slightly more accurate when classes nearly normal, especially if n is small
Advantages of log. reg.:
- More emphasis on decision boundary [though not as much as SVMs] [Show figure of log.reg. vs. LDA for tight boundary (logregvsLDAauni.pdf)]
- Hence less sensitive to outliers
- Easy & elegant treatment of "partial" class membership; LDA is all-or-nothing
- More robust on some non-Gaussian distributions (e.g. large skew)
lec 11.4 ROC CURVES (for test sets)
========
[Show ROC curve (ROC.pdf).]
[This is a ROC_curve. That stands for receiver_operating_characteristics,
which is an awful name but we're stuck with it for historical reasons.
It is made by running a classifier on the test set or validation set.
It shows the rate of false positives vs. true positives for a range of
settings.
We assume there is a knob we can turn to trade off these two types of error;
in this case, that knob is the probability threshold for Gaussian discriminant
analysis or linear regression.
However, neither axis is the probability threshold.]
x-axis: "false positive rate = % of -ve classified as +ve"
y-axis: "true positive rate = % of +ve classified as +ve aka sensitivity"
"false negative rate": vertical distance from curve to top
"_specificity_": horizontal distance from curve to right
[You generate this curve by trying every probability threshold; for each
threshold, measure the false positive & true positive rates and plot a point.]
upper right corner: "always classify +ve (Pr >= 0)"
lower left corner: "always classify -ve (Pr > 1)"
diagonal: "random classifiers"
[A rough measure of a classifier's effectiveness is the area under the curve.
For a classifier that is always correct and never assigns a probability
between 0 and 1, the area under the curve is one.
[IMPORTANT: In practice, the trade-off between false negatives and false positives is usually negotiated by choosing a point on this plot, based on real test data, and NOT by taking the choice of threshold that's best in theory.]
lec 11.5 LEAST-SQUARES POLYNOMIAL REGRESSION
=================================
Replace each X with feature vector Phi(X ) with all terms of degree 1...p i i 2 2 T e.g. Phi(X ) = [ X X X X X X 1 ] i i1 i1 i2 i2 i1 i2
[Notice that we've added the fictitious dimension "1" here, so we don't need to add it again. This basis covers all polynomials quadratic in X_i1, X_i2.]
Can also use non-polynomial features (e.g. edge detectors). Otherwise just like linear or logistic regression.
Log. reg. + quadratic features = same logistic posteriors as QDA.
Very easy to overfit! [Show overfits (overunder.png; UScensusquartic.png)]