# ILLINOIS CS 446 - 091917.2 (8 pages)

Previewing pages*1, 2, 3*of 8 page document

**View the full content.**## 091917.2

Previewing pages *1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 091917.2

0 0 35 views

- Pages:
- 8
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview: **

CS446 Machine Learning Fall 2017 Lecture 8 Variable Selection Lecturer Sanmi Koyejo Scribe Sarah Christensen Sept 19th 2017 Ridge Regression Recap of MAP Estimation Maximum a posteriori MAP estimation takes in a set of observations Dn Xi yi ni 1 and seeks a parameter M AP that maximizes the posterior distribution An important assumption to recognize is that the model parameters here are treated random variables drawn from a distribution M AP argmax P Dn Next we can rewrite the above equation using Bayes Theorem M AP argmax P Dn P P Dn Since the denominator does not depend on we can ignore this term and take the log to get the log likelihood function M AP argmax log P Dn log P Notice that this is similar to the maximum likelihood estimate for but has an additional term that incorporates a prior distribution over Ridge Regression Now we introduce a regularized least squares regression method called a Ridge regression where MAP estimation is used with a Gaussian prior to estimate the weight vector More specifically it is a linear regression where yi N wT xi 2 and w N 0 2 We have shown previously that wM AP argmin y Xw 22 w 22 where w 1 2 2 1 2 8 Variable Selection We have also previously shown that a closed form solution to this minimization problem exists To aid with visualization Equation 1 can be rewritten with Lagrange multipliers wM AP argmin y Xw 22 subject to w 22 for some 0 w Figure 1 A graphical interpretation of the Ridge regression in two dimensions The MAP estimator wM AP can be found at the intersection of the contour plot and the l2 ball This figure was adapted from Singh and Poczos 2014 Notice that Equation 1 looks similar to that of ordinary least squares OLS but there is an extra term that shifts the correlation matrix OLS can suffer from a problem of overfitting and small changes in the observed data can sometimes lead to big changes in the estimated parameters Ridge regression is an instance of shrinkage or regularization which tries to address this issue The additional term in the Ridge regression is a l2 regularizer that penalizes all wi i e makes the wi s smaller in magnitude but penalizes larger wi s more This dampening reduces the effect a single feature can have on the results In practice this regularization can improve performance Note that the mean squared error MSE is equal to the sum of the bias squared plus the variance squared Regularization reduces variance at the expense of an increase in bias An increase in bias can still be acceptable if overall the mean squared error is ultimately reduced Variable Selection Introduction Variable selection also known as feature selection is the process of selection the set of relevant variables to be used in a model such as a linear regression Some variables that are available may not be relevant moreover including

View Full Document