DOC PREVIEW
ILLINOIS CS 446 - 091917.2

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 8 : Variable SelectionLecturer: Sanmi Koyejo Scribe: Sarah Christensen, Sept. 19th, 2017Ridge RegressionRecap of MAP EstimationMaximum a posteriori (MAP) estimation takes in a set of observationsDn={Xi, yi}ni=1and seeks a parameterθMAPthat maximizes the posterior distribution. An importantassumption to recognize is that the model parameters here are treated random variablesdrawn from a distribution.θMAP= argmaxθP (θ|Dn)Next, we can rewrite the above equation using Bayes’ Theorem:θMAP= argmaxθP (Dn|θ)P (θ)P (Dn)Since the denominator does not depend onθ, we can ignore this term and take the log toget the log-likelihood function.θMAP= argmaxθlog P (Dn|θ) + log P (θ)Notice that this is similar to the maximum likelihood estimate forθ, but has an additionalterm that incorporates a prior distribution over θ.Ridge RegressionNow, we introduce a regularized least squares regression method called a Ridge regressionwhere MAP estimation is used with a Gaussian prior to estimate the weight vector. Morespecifically, it is a linear regression whereyi∼ N(wTxi, σ2) andw ∼ N(0, λ2). We haveshown previously thatwMAP= argminw||y − Xw||22+ τ ||w||22where τ =σ2λ2. (1)12 8 : Variable SelectionWe have also previously shown that a closed form solution to this minimization problemexists. To aid with visualization, Equation 1 can be rewritten with Lagrange multiplierswMAP= argminw||y − Xw||22subject to ||w||22≤ λ for some λ > 0.Figure 1: A graphical interpretation of the Ridge regression in two dimensions.The MAP estimatorwMAPcan be found at the intersection of the contourplot and the l2 ball. This figure was adapted from Singh and Poczos (2014).Notice that Equation 1 looks similar to that of ordinary least squares (OLS), but there is anextra term that shifts the correlation matrix. OLS can suffer from a problem of overfitting,and small changes in the observed data can sometimes lead to big changes in the estimatedparameters. Ridge regression is an instance of shrinkage or regularization, which tries toaddress this issue. The additional term in the Ridge regression is a l2 regularizer thatpenalizes allwi(i.e. makes thewi’s smaller in magnitude), but penalizes largerwi’s more.This dampening reduces the effect a single feature can have on the results.In practice, this regularization can improve performance. Note that the mean squared error(MSE) is equal to the sum of the bias squared plus the variance squared. Regularizationreduces variance at the expense of an increase in bias. An increase in bias can still beacceptable if overall the mean squared error is ultimately reduced.Variable SelectionIntroductionVariable selection, also known as feature selection, is the process of selection the set ofrelevant variables to be used in a model (such as a linear regression). Some variables thatare available may not be relevant; moreover, including irrelevant variables may give worsepredictions in practice. Having a systematic approach to variable selection can allow us8 : Variable Selection 3to interpret the remaining variables in the model as having significance, and also has aregularization effect since only a few variables end up being important for prediction.One natural systematic way to perform variable selection might be to minimize the following,which is similar to Equation 1 but with the l0 norm:ˆw = argminw||y − Xw||22+ τ ||w||0(2)Again, for visualization purposes, this can be rewritten using Lagrange multipliersˆw = argminw||y − Xw||22subject to ||w||0≤ λ for some λ > 0.Figure 2: A graphical interpretation in two dimensions of the l0 norm regular-izer. The optimal estimatorˆwcan be found at the intersection of the contourplot and the l0 ball. In this picture, since the contour plot intersect thew2axis first, we see thatw1will be set to zero. This figure was adapted fromSingh and Poczos (2014).This approach for variable selection has taken the discrete problem of choosing variablesand transformed it into a continuous optimization problem. Figure 2 motivates this idea intwo dimensions by demonstrating that if the contour plot intersects one axis first, the othervariable will be assigned a weight of 0.The problem is that this kind of combinatorial optimization problem is still hard as wedon’t know how to find an exact answer without an exponential size search. However, thisproblem is extremely popular and has been well-studied so many heuristics for solving thisproblem have been developed. We present several of these heuristics in the remainder ofthese lecture notes.Forward and Reverse Selection AlgorithmsForward selection greedily adds variables one at a time, trying to optimize predictionaccuracy. Reverse selection, also sometimes referred to as backwards elimination, greedily4 8 : Variable Selectionremoves variables one at a time, likewise trying to optimize prediction accuracy. Bothof these methods are heuristics for solving Equation 2 and therefore do not guarantee anoptimal solution; however, there are ways to find a bound on how far the solution is fromoptimal. Before we introduce the algorithms, we must define a variation on the formula forthe residual sum of squares (RSS) for a set of indicesv. One way to remove the impact of avariable is to set its weight to zero as we saw above:RSSzeroweights(¯wv) = ||y − X¯wv||22where¯wv= wiif i ∈ v and ¯wv= 0 if i 6∈ v.Alternatively, instead of padding with 0 weights for variables we wish to exclude, we caninstead shrinkXand remove these variables altogether. This is what we use in practice andin the algorithms below:RSS(¯wv) = ||y − Xv¯wv||22where Xv= {Xi|i ∈ v},¯wv= wiif i ∈ v and ¯wv= 0 if i 6∈ v.Now, we are ready to give the pseudocode for the Forward Selection Algorithm. Notice thatvariables get added to the setvone at a time, and at each pass through the while loop thenew variable which minimizes the RSS is chosen (i.e., the algorithm is greedy).Algorithm 1 Forward Selection1: procedure2: v ← ∅3: s ← [d] = {1, 2, ..., d}4: λ ← Number of features to be selected for model, must be less than d5: while |v| < λ do6: k∗= argmink∈sRSS(wv∪{k})7: v ← v ∪ k∗8: s ← s \ k∗return vThe Reverse Selection Algorithm is very similar to the Forward Selection Algorithm exceptthat variables get removed from the full set of variables one at a time. Each pass throughthe while loop choses to remove the variable which, when removed,


View Full Document

ILLINOIS CS 446 - 091917.2

Download 091917.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 091917.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 091917.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?