DOC PREVIEW
ILLINOIS CS 446 - 091917.1

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 1 : Forward / Backward selection; Variable SelectionLecturer: Sanmi Koyejo Scribe: Dipali Ranjan, Sept. 19th, 2017Recap1.1 MAP•The key idea behind Maximum A Posteriori (MAP) is to find a parameter estimatewhich is the maximum of posterior for given data (Dn=(xi, yi)ni=1) i.e we choosevalue that is most probable given observed data and prior beliefs.• We assume that θ is a random variableˆθMAP= argmaxθP (θ|Dn)As we are maximizing, for Naive Bayes we don’t want to focus in estimate but thepeak of that estimate.= argmaxθP (Dn|θ)P (θ)P (Dn)(1)As we are maximizing the above equation andθdoesn’t appear in the denominator,we can rewrite the above equation as,= argmaxθP (Dn|θ)P (θ) (2)Taking Log Likelihood of eqn(2)= argmaxθlog P (Dn|θ) + log P (θ)(3)1.2 Example- Linear Regression (Ridge Regression)•One problem with Maximum Likelihood estimation is that it can result in overfitting.We use Ridge Regression to ameliorate this problem by using MAP estimation with aGaussian prior. Murphy (2012) Ridge regression use L2 regularization to minimizesum of squares of wientries.12 1 : Forward / Backward selection; Variable Selection•Let’s take an example of Linear Regression model such thaty ∼ N(wTxi) andwi∼ N (0, λ2)wMAP= argminw||y − Xw||22+ τ ||w||22(4)where,τ=σ2λ2. In the previous lecture we saw how to solve eqn(4) and get a closedform solution• Why do we need Regularization?: The regularization termτ||w||22penalizes allwi’s, i.e. it reduces wi’s. This helps in reducing the effect of single large features.• Bias/Variance Tradeoff in Regression: Mean Square Error =variance+bias2This is called the bias-variance tradeoff . It shows that it might be wise to use abiased estimator, so long as it reduces our variance, assuming our goal is to minimizesquared error. Regularization helps in reducing variance, increasing bias and in practiceimproving performance.2. L2Regularization• Equation(4) can be re-written as,wMAP= argminw||y − Xw||22+ τ ||w||22There exists a λ such that,= minw||y − Xw||22(5)Such that ||w||22≤ λ•In Figure1 the Minimizing cost refers to problems without regularization eqn(5). Thesolution to the problem in (5) is given by the intersection of contour and the circle(radius λ) showed as Minimize cost + penalty in the figure.13. Variable SelectionIntroduction•Variable selection means selecting which variables to include in our model (rather thansome sort of selection which is itself variable). As such, it is a special case of modelselection. People tend to use the phrase variable selection when the competing models1We can also write it as an Inductive Bias with linear predictions and small weights1 : Forward / Backward selection; Variable Selection 3Figure 1: L2 regularization Raschka (2017)differ on which variables should be included, but agree on the mathematical form thatwill be used for each variable Shalizi (2015)–Let’s examine a model designed to predict temperature on phone. Such a modelwould havey=T emperature;with all the sensors available to us in x ={LightSensor , Sound sensor, Temperature sensor}–We can try to use all the features inxto predict the temperature using somefunction f : x −→ y–Can we do better in predicting Temperature? Practically, we only need Tempera-ture sensor to predict temperature. Due to noise and other effects in data, havingmultiple irrelevant features might result in poor predictions.– Example- We can perform better by performing variable selection with theL0norm. Rewriting eqn(4) asargminw||y − Xw||22+ τ ||w||0(6)This is calledL0regularization. We have converted the discrete optimizationproblem into a continuous one (overw ∈ RD); however, thel0pseudo-normmakes the objective very non smooth, so this is still hard to optimize. We willdiscuss different solutions to this in the rest of the lecture. Murphy (2012)– Equation (6) can be rewritten asminw||y − Xw||22Such that ||w0|| ≤ λ4 1 : Forward / Backward selection; Variable Selection4. Forward/Reverse Selection•A forward selection starts with zero features, greedily adding ( improving predictionaccuracy) the most relevant feature and stopping when it has added the desired levelof relevant feature.• Let’s define a Residual Sum of Square (RSS) for a given modelRSS(w) = ||y − Xw22||RSS(wv) = ||y − XWv||22(7)= ||y − XWv||22(8)Where v is a set of indices and,wv=(wii ∈ v0 otherwiseX =X1X2...XdXi=X1......Xv(9)Where Xv=Xi|i ∈ V• Example: V= 1,3,5 Xv= [X1, X2, X3]•Checking the dimensions of each element in eqn(7):wv=size |V | ×1Xv=size n ×|V |and y = size|n| × 1• Algorithm: Initialize set V = {φ} , S=[d]=[1,2...d]While |V | < λk∗= argmink∈SRSS(WV ∪{k})V = V ∪ k∗S = S \ K∗Whereλ < d. We add features selected to set V from set S. This algorithm is aheuristic to solve the combinatorial problem described in eqn(6).• Reverse Selection: Backwards selection starts with all variables in the model, andthen deletes the worst one at each step. Murphy (2012). LetV= [d]S={φ}While |V | > λk∗= argmink∈VRSS(WV \{k})V = V \ k∗• Problem with Forward SelectionEverytime we select a new variable, we solve theLeast Square problem again. And depending upon the size of the data, this can be a1 : Forward / Backward selection; Variable Selection 5computationally expensive issue. It also doesn’t catch problems where two variabletogether make a bigger effect rather than individually.• Closest Convex Regularizer to ||w0||:l1norm can be wriiten as||w||1where||w||1=Pdi=1|wi|. This can be geometrically represented as Figure 2.Figure 2: Illustration ofL1regularization of a least squares problem (Lasso). Based onFigure 3.12 of (Hastie et al. 2001).•The intersection of L1 norm andˆwcontours gives us the vectorˆw∗giving a new convexoptimization problem written as,minwRSS(w) + λ||w||1•It should be geometrically clear that as we relax the constraintκ(where,||w||1≤ κ),we grow thel1ball until it meets the objective; the corners of the ball are more likelyto intersect the ellipse than one of the sides, especially in high dimensions, becausethe corners stick out more. The corners correspond to sparse solutions, which lie onthe coordinate axes. Murphy (2012)•NOTE: In eqn(10), the termλ||w||1isn’t differentiable at zero (Figure 3). For a singlevariable, L1norm is the sum


View Full Document

ILLINOIS CS 446 - 091917.1

Download 091917.1
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 091917.1 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 091917.1 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?