# ILLINOIS CS 446 - 091917.1 (9 pages)

Previewing pages*1, 2, 3*of 9 page document

**View the full content.**## 091917.1

Previewing pages
*1, 2, 3*
of
actual document.

**View the full content.**View Full Document

## 091917.1

0 0 50 views

- Pages:
- 9
- School:
- University of Illinois - urbana
- Course:
- Cs 446 - Machine Learning

**Unformatted text preview:**

CS446 Machine Learning Fall 2017 Lecture 1 Forward Backward selection Variable Selection Lecturer Sanmi Koyejo Scribe Dipali Ranjan Sept 19th 2017 Recap 1 1 MAP The key idea behind Maximum A Posteriori MAP is to find a parameter estimate which is the maximum of posterior for given data Dn xi yi ni 1 i e we choose value that is most probable given observed data and prior beliefs We assume that is a random variable M AP argmax P Dn As we are maximizing for Naive Bayes we don t want to focus in estimate but the peak of that estimate P Dn P argmax P Dn 1 As we are maximizing the above equation and doesn t appear in the denominator we can rewrite the above equation as argmax P Dn P 2 Taking Log Likelihood of eqn 2 argmax log P Dn log P 3 1 2 Example Linear Regression Ridge Regression One problem with Maximum Likelihood estimation is that it can result in overfitting We use Ridge Regression to ameliorate this problem by using MAP estimation with a Gaussian prior Murphy 2012 Ridge regression use L2 regularization to minimize sum of squares of wi entries 1 2 1 Forward Backward selection Variable Selection Let s take an example of Linear Regression model such that y N wT xi and wi N 0 2 wM AP argmin y Xw 22 w 22 4 w 2 2 where form solution In the previous lecture we saw how to solve eqn 4 and get a closed Why do we need Regularization The regularization term w 22 penalizes all wi s i e it reduces wi s This helps in reducing the effect of single large features Bias Variance Tradeoff in Regression Mean Square Error variance bias2 This is called the bias variance tradeoff It shows that it might be wise to use a biased estimator so long as it reduces our variance assuming our goal is to minimize squared error Regularization helps in reducing variance increasing bias and in practice improving performance 2 L2 Regularization Equation 4 can be re written as wM AP argmin y Xw 22 w 22 w There exists a such that min y Xw 22 w 5 Such that w 22 In Figure1 the Minimizing cost refers to problems without regularization eqn 5 The solution to the problem in 5 is given by the intersection of contour and the circle radius showed as Minimize cost penalty in the figure 1 3 Variable Selection Introduction Variable selection means selecting which variables to include in our model rather than some sort of selection which is itself variable As such it is a special case of model selection People tend to use the phrase variable selection when the competing models 1 We can also write it as an Inductive Bias with linear predictions and small weights 1 Forward Backward selection Variable Selection 3 Figure 1 L2 regularization Raschka 2017 differ on which variables should be included but agree on the mathematical form that will be used for each variable Shalizi 2015 Let s examine a model designed to predict temperature on phone Such a model would have y T emperature with all the sensors available to us in x Light Sensor Sound sensor Temperature sensor We can try to use all the features in x to predict the temperature using some function f x y Can we do better in predicting Temperature Practically we only need Temperature sensor to predict temperature Due to noise and other effects in data having multiple irrelevant features might result in poor predictions Example We can perform better by performing variable selection with the L0 norm Rewriting eqn 4 as argmin y Xw 22 w 0 6 w This is called L0 regularization We have converted the discrete optimization problem into a continuous one over w RD however the l0 pseudo norm makes the objective very non smooth so this is still hard to optimize We will discuss different solutions to this in the rest of the lecture Murphy 2012 Equation 6 can be rewritten as min y Xw 22 w Such that w0 4 1 Forward Backward selection Variable Selection 4 Forward Reverse Selection A forward selection starts with zero features greedily adding improving prediction accuracy the most relevant feature and stopping when it has added the desired level of relevant feature Let s define a Residual Sum of Square RSS for a given model RSS w y Xw22 RSS wv y XWv 22 7 y XWv 22 8 Where v is a set of indices and wi i v wv 0 otherwise X1 X1 X2 X Xi Xd Xv 9 Where Xv Xi i V Example V 1 3 5 Xv X1 X2 X3 Checking the dimensions of each element in eqn 7 wv size V 1 Xv size n V and y size n 1 Algorithm Initialize set V S d 1 2 d While V k argmink S RSS WV k V V k S S K Where d We add features selected to set V from set S This algorithm is a heuristic to solve the combinatorial problem described in eqn 6 Reverse Selection Backwards selection starts with all variables in the model and then deletes the worst one at each step Murphy 2012 Let V d S While V k argmink V RSS WV k V V k Problem with Forward Selection Everytime we select a new variable we solve the Least Square problem again And depending upon the size of the data this can be a 1 Forward Backward selection Variable Selection 5 computationally expensive issue It also doesn t catch problems where two variable together make a bigger effect rather than individually Closest Convex Regularizer to w0 l1 norm can be wriiten as w 1 where P w 1 di 1 wi This can be geometrically represented as Figure 2 Figure 2 Illustration of L1 regularization of a least squares problem Lasso Based on Figure 3 12 of Hastie et al 2001 The intersection of L1 norm and w contours gives us the vector w giving a new convex optimization problem written as min RSS w w 1 w It should be geometrically clear that as we relax the constraint where w 1 we grow the l1 ball until it meets the objective the corners of the ball are more likely to intersect the ellipse than one of the sides especially in high dimensions because the corners stick out more The corners correspond to sparse solutions which lie on the coordinate axes Murphy 2012 NOTE In eqn 10 the term w 1 isn t differentiable at zero Figure 3 For a single variable L1 norm is the sum of absolute values of the variables 6 1 Forward Backward selection Variable Selection Figure 3 Loss Function for L1 6 Proximal Gradient Descent Consider a convex objective of the form min f w h w w 10 where f w representing the loss is convex and differentiable and h w representing the regularizer is convex but not necessarily differentiable We can use Gradient Descent to solve eqn 11 if both function f and h are convex We …

View Full Document