# ILLINOIS CS 446 - 091917.2 (8 pages)

Previewing pages*1, 2, 3*of 8 page document

**View the full content.**## 091917.2

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 091917.2

0 0 44 views

- Pages:
- 8
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 8 Variable Selection Lecturer Sanmi Koyejo Scribe Sarah Christensen Sept 19th 2017 Ridge Regression Recap of MAP Estimation Maximum a posteriori MAP estimation takes in a set of observations Dn Xi yi ni 1 and seeks a parameter M AP that maximizes the posterior distribution An important assumption to recognize is that the model parameters here are treated random variables drawn from a distribution M AP argmax P Dn Next we can rewrite the above equation using Bayes Theorem M AP argmax P Dn P P Dn Since the denominator does not depend on we can ignore this term and take the log to get the log likelihood function M AP argmax log P Dn log P Notice that this is similar to the maximum likelihood estimate for but has an additional term that incorporates a prior distribution over Ridge Regression Now we introduce a regularized least squares regression method called a Ridge regression where MAP estimation is used with a Gaussian prior to estimate the weight vector More specifically it is a linear regression where yi N wT xi 2 and w N 0 2 We have shown previously that wM AP argmin y Xw 22 w 22 where w 1 2 2 1 2 8 Variable Selection We have also previously shown that a closed form solution to this minimization problem exists To aid with visualization Equation 1 can be rewritten with Lagrange multipliers wM AP argmin y Xw 22 subject to w 22 for some 0 w Figure 1 A graphical interpretation of the Ridge regression in two dimensions The MAP estimator wM AP can be found at the intersection of the contour plot and the l2 ball This figure was adapted from Singh and Poczos 2014 Notice that Equation 1 looks similar to that of ordinary least squares OLS but there is an extra term that shifts the correlation matrix OLS can suffer from a problem of overfitting and small changes in the observed data can sometimes lead to big changes in the estimated parameters Ridge regression is an instance of shrinkage or regularization which tries to address this issue The additional term in the Ridge regression is a l2 regularizer that penalizes all wi i e makes the wi s smaller in magnitude but penalizes larger wi s more This dampening reduces the effect a single feature can have on the results In practice this regularization can improve performance Note that the mean squared error MSE is equal to the sum of the bias squared plus the variance squared Regularization reduces variance at the expense of an increase in bias An increase in bias can still be acceptable if overall the mean squared error is ultimately reduced Variable Selection Introduction Variable selection also known as feature selection is the process of selection the set of relevant variables to be used in a model such as a linear regression Some variables that are available may not be relevant moreover including irrelevant variables may give worse predictions in practice Having a systematic approach to variable selection can allow us 8 Variable Selection 3 to interpret the remaining variables in the model as having significance and also has a regularization effect since only a few variables end up being important for prediction One natural systematic way to perform variable selection might be to minimize the following which is similar to Equation 1 but with the l0 norm w argmin y Xw 22 w 0 2 w Again for visualization purposes this can be rewritten using Lagrange multipliers w argmin y Xw 22 subject to w 0 for some 0 w Figure 2 A graphical interpretation in two dimensions of the l0 norm regularizer The optimal estimator w can be found at the intersection of the contour plot and the l0 ball In this picture since the contour plot intersect the w2 axis first we see that w1 will be set to zero This figure was adapted from Singh and Poczos 2014 This approach for variable selection has taken the discrete problem of choosing variables and transformed it into a continuous optimization problem Figure 2 motivates this idea in two dimensions by demonstrating that if the contour plot intersects one axis first the other variable will be assigned a weight of 0 The problem is that this kind of combinatorial optimization problem is still hard as we don t know how to find an exact answer without an exponential size search However this problem is extremely popular and has been well studied so many heuristics for solving this problem have been developed We present several of these heuristics in the remainder of these lecture notes Forward and Reverse Selection Algorithms Forward selection greedily adds variables one at a time trying to optimize prediction accuracy Reverse selection also sometimes referred to as backwards elimination greedily 4 8 Variable Selection removes variables one at a time likewise trying to optimize prediction accuracy Both of these methods are heuristics for solving Equation 2 and therefore do not guarantee an optimal solution however there are ways to find a bound on how far the solution is from optimal Before we introduce the algorithms we must define a variation on the formula for the residual sum of squares RSS for a set of indices v One way to remove the impact of a variable is to set its weight to zero as we saw above RSSzeroweights w v y Xw v 22 where w v wi if i v and w v 0 if i 6 v Alternatively instead of padding with 0 weights for variables we wish to exclude we can instead shrink X and remove these variables altogether This is what we use in practice and in the algorithms below RSS w v y Xv w v 22 where Xv Xi i v w v wi if i v and w v 0 if i 6 v Now we are ready to give the pseudocode for the Forward Selection Algorithm Notice that variables get added to the set v one at a time and at each pass through the while loop the new variable which minimizes the RSS is chosen i e the algorithm is greedy Algorithm 1 Forward Selection 1 procedure 2 v 3 s d 1 2 d 4 Number of features to be selected for model must be less than d 5 while v do 6 k argmink s RSS wv k 7 v v k 8 s s k return v The Reverse Selection Algorithm is very similar to the Forward Selection Algorithm except that variables get removed from the full set of variables one at a time Each pass through the while loop choses to remove the variable which when removed minimizes the RSS Algorithm 2 Reverse Selection procedure v d 1 2 d Number of features to be selected for model must be less than d while v do k argmink s RSS wv k v v k return v 8 Variable Selection 5 Here we have discussed forward and reverse selection in the context of least

View Full Document