UP | HOME

lec 10 REGRESSION aka Fitting Curves to Data

Table of Contents

lec 10 REGRESSION aka Fitting Curves to Data

Classification: given sample x, predict class (often binary) Regression: given sample x, predict a numerical value

[Classification gives a discrete prediction, whereas regression gives us a quantitative prediction, usually on a continuous scale.] [We've already seen an example of regression in Gaussian discriminant analysis. QDA and LDA don't just give us a classifier; they also give us the probability that a particular class label is correct. So QDA and LDA implicitly do regression on probability values.]

  • Choose form of regression fn h(x; p) with parameters p (h = hypothesis)
    • like predictor fn in classification [e.g. linear, quadratic, logistic in x]
  • Choose a cost fn (objective fn) to optimize
    • usually based on a loss fn; e.g. risk fn = expected loss
Some regression fns:
(1)  linear:     h(x; w, alpha) = w^T x + alpha
(2)  polynomial
     [equivalent to linear regression with added polynomial features]
(3)  logistic:   h(x; w, alpha) = s(w^T x + alpha)
                                             1
     recall:     logistic fn s(gamma) = -----------
                                             -gamma
                                        1 + e
     [This choice is interesting.  You'll recall that LDA produces a posterior
      probability fn with this equation.  So this equation seems to be
      a natural form for modeling certain probabilities.  If we want to model
      probabilities, sometimes we use LDA; but alternatively, we could skip
      fitting Gaussians to samples, and instead just directly try to fit
      a logistic function to a set of probabilities.]

Some loss fns:  let z be prediction h(x); y be true value
(A)  L(z, y) = (z - y)^2                        _squared_error_
(B)  L(z, y) = |z - y|                          _absolute_error_
(C)  L(z, y) = - y ln z - (1 - y) ln (1 - z)    _logistic_loss_:  y in [0, 1],
                                                                  z in (0, 1)
Some cost fns to minimize:
            1  n
(a)  J(h) = - sum L(h(X ), y )                  _mean_loss_
            n i=1      i    i                   [you can leave out the "1/n"]
             n
(b)  J(h) = max L(h(X ), y )                    _maximum_loss_
            i=1      i    i
             n
(c)  J(h) = sum omega_i L(h(X ), y )            _weighted_sum_ [some samples
            i=1              i    i             more important than others]
            1  n                           2
(d)  J(h) = - sum L(h(X ), y ) + lambda |w|     l  _penalized_/_regularized_
            n i=1      i    i                    2
            1  n
(e)  J(h) = - sum L(h(X ), y ) + lambda |w|     l  _penalized_/_regularized_
            n i=1      i    i              l_1   1

Least-squares linear regr.:  (1) + (A) + (a) \
Weighted least-squ. linear:  (1) + (A) + (c) | quadratic; minimize w/calculus
Ridge regression:            (1) + (A) + (d) /
Lasso:                       (1) + (A) + (e) \ minimize w/gradient descent
Logistic reg.:               (3) + (C) + (a) /
Least absolute deviations:   (1) + (B) + (a) \ minimize w/linear programming
Chebyshev criterion:         (1) + (B) + (b) /

[I have given you several choices of regression fn form, several choices of loss fn, and several choices of objective fn. These are interchangeable parts where you can snap one part out and replace it with a different one. But the optimization algorithm and its speed depend crucially on which parts you pick. Let's see some examples.]

lec 10.2 LEAST-SQUARES LINEAR REGRESSION

linear regression fn (1) + squared loss fn (A) + cost fn (a).

-----------------------------------------------------------
|                               n                       2 |
| Find w, alpha that minimizes sum (X  . w + alpha - y )  |
|                              i=1   i                i   |
-----------------------------------------------------------

[Show linear regression example (linregress.pdf)]

Convention:  X is n-by-d _design_matrix_ of samples
             # y is d-vector of dependent scalars.
  -                                             -                 -     -
  | X_11   X_12   ...   |  X_1j  |   ...   X_1d |                 | y_1 |
  | X_21   X_22         |  X_2j  |         X_2d |                 | y_2 |
  |  ...                |        |              |                 |     |
  | --------------------+--------+------------- |           T     | ... |
  | X_i1   X_i2         |  X_ij  |         X_id | < sample X      |     |
  | -----------------------------+------------- |           i     |     |
  |  ...                |        |              |                 |     |
  | X_n1   X_n2   ...   |  X_nj  |         X_nd |                 | y_n |
  -                                             -                 -     -
                           ^                                        ^
                           feature column X                         y
                                           *j
Usually n > d.

Recall fictitious dimension trick [from Lecture 3]: replace x . w + alpha with

                    -       -
                    |  w_1  |
  [ x_1  x_2  1 ] . |  w_2  |
                    | alpha |
                    -       -

Now X is an n-by-(d+1) matrix;   w is a (d+1)-vector.

  -----------------------------------
  |                               2 |
  | Find w that minimizes |Xw - y|  | = RSS(w), for _residual_sum_of_squares_
  -----------------------------------

[We know X and y, but w is unknown; we want to solve for w.]

Optimize by calculus:

                   T T       T      T
minimize RSS(w) = w X Xw - 2y Xw + y y

                    T       T
    grad RSS    = 2X Xw - 2X y = 0

                    T      T
              ==>  X Xw = X y                   <== the _normal_equations_
                   \_/\_____/                       [w unknown; X & y known]
        (d+1)-by-(d+1)  (d+1)-vectors
    T
If X X is singular, problem is underconstrained [because the samples all lie on
a common hyperplane.  Notice that X^T X is always positive semidefinite.]

                                     T  -1  T
We use a linear solver to find w = (X X)   X y         [never actually invert!]
                                   \________/
                                   X^+, the _pseudoinverse_ of X, (d+1)-by-n

[X is usually not square, so it can't have an inverse. However, every X has a pseudoinverse X^+, and if X has rank d+1, then X^+ is a "left inverse".]

           +      T  -1  T
Observe:  X X = (X X)   X X = I   <= (d+1)-by-(d+1)   [which explains its name]

                                        ^                  ^          +
Observe:  the predicted values of y are y  = h(x )   ==>   y = Xw = XX y = Hy
                                         i      i
                      +
          where H = XX  is called the _hat_matrix_ because it puts the hat on y
                        n-by-n [and if n > d+1, it is not the identity matrix!]

Interpretation as a projection:

  ^
- y = Xw in R^n is a linear combination of columns of X (one per feature)
- For fixed X, varying w, Xw is subspace of R^n spanned by columns

               O y
       ________|___________   Figure in n-dimensional space (1 dim/sample)
      /       _|          /   NOT d-dimensional feature space (row space of X).
     /       | | ^       /
    /        ' O y = Xw /  <== subspace spanned by X's columns
   /                   /       (at most d+1 dimensions)
  ---------------------

              ^                  ^
- Minimizing |y - y| finds point y nearest y on subspace
  ==>  projects y onto subspace
       [the vertical line is the direction of projection and the error vector]
                                                              T
- Error is smallest when line is perpendicular to subspace:  X (Xw - y) = 0
                                                      ==> the normal equations!
- Hat matrix H does the projecting.  [Also sometimes called projection matrix.]

Advantages:

  • Easy to compute; just solve a linear system.
  • Unique, stable solution. [Except when the problem is underconstrained.]

Disadvantages:

  • Very sensitive to outliers, because error is squared!
  • Fails if X^T X is singular.

[Least-squares linear regression was apparently first posed and solved in 1801 by the great mathematician Carl Friedrich Gauss, who used least squares to predict the trajectory of the planetoid Ceres. A paper he wrote on the topic is regarded as the birth of modern linear algebra.]

lec 10.3 LOGISTIC REGRESSION (David Cox, 1958)

================= logistic regression fn (3) + logistic loss fn (C) + cost fn (a). Fits "probabilities" in range (0, 1).

Usually used for classification. The input y_i's can be probabilities, but in most applications they're all 0 or 1.

QDA, LDA: generative models logistic regression: discriminative model

With X and w including the fictitious dimension; alpha is w's last component...

-------------------------------------------------------------
| Find w that maximizes                                     |
|                                                           |
|      n  /                                               \ |
| J = sum | y  ln s(X  . w) + (1 - y ) ln (1 - s(X  . w)) | |
|     i=1 \  i       i              i             i       / |
-------------------------------------------------------------

[Note that we are maximizing, not minimizing, this function because I've flipped the sign of the logistic loss (C).] [Show log loss plots (logloss0.pdf, logloss.7.pdf)]

-J(w) is convex! [J is "concave".] Solve by gradient ascent.

[To do gradient ascent, we'll need to compute some derivatives.]

                                              -gamma
                   d          1              e
  s'(gamma)  =  -------  -----------  =  --------------
                d gamma       -gamma           -gamma 2
                         1 + e           (1 + e      )

             =  s(gamma) (1 - s(gamma))          [Show s' plot (dlogistic.pdf)]

Let s  = s(X  . w)
     i      i
                    / y_i            1 - y_i          \
    grad  J  =  sum | --- grad s_i - ------- grad s_i |
        w           \ s_i            1 - s_i          /

                    / y_i   1 - y_i \
             =  sum | --- - ------- | s  (1 - s ) X
                    \ s_i   1 - s_i /  i       i   i

             =  sum (y  - s ) X
                      i    i   i
                                         n
Gradient ascent rule:  w <- w + epsilon sum (y  - s(X  . w)) X
                                        i=1   i      i        i

Stochastic gradient ascent:  w <- w + epsilon (y  - s(X  . w)) X
                                                i      i        i

Works best if we shuffle samples in random order, process one by one. For very large n, sometimes converges before we visit all samples!

[This looks a lot like the perceptron learning rule. The only difference is that the "- s_i" part is new.]

Starting from w = 0 works well in practice.

[Show mwascom's logistic regression plot (problogistic.png) from http://stackoverflow.com/questions/28256058/plotting-decision-boundary-of-logistic-regression]

Date: 2017-06-22 五

Author: yiddi

Created: 2018-08-12 日 14:57

Validate