MIT 6 867 - Problem Set 4 - D2463705

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 867> Problem Set 4

MIT 6 867 - Problem Set 4

School name Massachusetts Institute of Technology

Course 6 867- Machine Learning

Pages 10

Download Save

Unformatted text preview:

6 867 Machine Learning Problem Set 4 Due Friday 11 7 Please address all questions and comments about this problem set to 6 867 staff ai mit edu You will need to use MATLAB for some of the problems but most of the code is provided If you are not familiar with MATLAB please consult http www ai mit edu courses 6 867 matlab html and the links therein Problem 1 Regularized Least Squares Feature Selection In this problem we consider a regularized approach to feature selection in a simple regression context Suppose we have training inputs X x1 xn 0 and corresponding outputs y y1 yn 0 where yi R We train a linear predictor y x w w0 x based on a collection of M features x 1 x M x 0 The estimation criterion is the following regularized least squares objective J w n 1 1X yi w0 xi 2 kwk1 n i 1 2 1 where kwk1 is the l1 norm kwk1 M X wk 2 k 1 We seek the optimum parameters w w functions of that minimize the regularized objective w arg minM J w 3 w R The use of the l1 norm penalty leads to a sparse solutions such that as becomes large many of the parameters w are forced to zero The corresponding features are then ignored irrelevant in the prediction y w 0 x To solve this l1 regularized least squares problem we consider a simple coordinate descent approach to optimization In this approach we adjust one parameter at a time so as to minimize the objective while keeping the remaining parameters fixed w k arg min J w wk 1 4 By letting k iterate over the set of all indices 1 M successively refining single parameters and repeating this iteration over parameters many times the parameter vector asymptotically converges to the global minimum of the convex objective function In general the coordinate descent method requires a subroutine to perform line minimization minimization along the chosen coordinate Here we can solve the line minimization in closed form The regularization penalty is not smooth however so the typical approach of setting derivatives to zero needs to be broken down into cases While this is fairly simple in our regression context here we follow a slightly more general approach one that will be useful for solving other problems of this type How do we take derivatives of non differentiable functions The subdifferential of a convex function f w is defined as f w s f w f w s R 5 In other words a subdifferential is the range of slopes s such that the line through w f w with slope s supports f e g contains the graph of f in it s upper half space This is a set valued generalization of the normal derivative and reduces to the normal derivative w f w f w whenever f is differentiable For example the subdifferential of the absolute value function f w w which is not differentiable at w 0 is 1 w 0 f w 1 1 w 0 1 w 0 6 We will use the following result from non smooth analysis Optimality Condition w is a global minimizer of a convex function f w if and only if 0 f w For example the optimality condition 0 f w for the absolute value function f w w holds if and only if w 0 Hence w 0 is the unique global minimizer of w Below we guide you through the analysis to solve the line minimization problem 4 1 1 10pts Show that the subdifferential of J w with respect to parameter wk is wk J w ak wk ck wk wk ak wk ck wk 0 ck ck wk 0 ak wk ck wk 0 7 8 with ak 1X 2 xi n i k 2 9 ck 1X 0 k xi yi w k k xi n i 10 where w k respectively k denote the vector of all parameters features except for parameter wk feature k Hint The subdifferential is given by J w 0 wk wk wk since the mean squared term is differentiable Observe that the numbers ak are non negative and typically positive and constant w r t the parameters Each ck depends on all parameters except for the parameter wk that we wish to update Can you interpret the coefficient ck 1 2 10pts Solve the nonsmooth optimality condition for w k s t 0 wk J w k You may find it helpful to consider each of the following cases separately a ck b ck c ck In each case provide a hand drawn plot wk J wk versus wk and label the zerocrossing w k Express w k as a function of ak ck and Also provide a hand drawn plot of w k versus ck What role does the regularization parameter play in this context relative to the coefficient ck We have provided you with training and test data in the file reg least sq mat Load this into MATLAB via load reg least sq You should then find four structures train small train med train large and test each containing an n m matrix X and a n 1 vector y Below you will write code to implement and test the coordinate descent method for solving the l1 regularized least squares problem skeletons provided For simplicity we will only consider linear regression e g set x x 1 3 10pts First write a MATLAB subroutine based on the skeleton reg least sq m to solve for w given the training data X y regularization parameter and an initial guess w0 to seed the coordinate descent algorithm Your procedure should repeatedly iterate over k 1 M and set wk w k use your earlier result All but the evaluation of ck and w k is already provided in the skeleton reg least sq m Test your procedure by evaluating w 1 for any of the training sets Initialize your procedure with w0 0 In practice it is often useful to solve for w for a range of values rather than for just one particular For instance we typically do not know what value of to use but might instead have some idea as to how large training error is still acceptable or how large kwk1 the penalty can be This requires that we try many values of One could also set based on cross validation or other generalization error analysis 3 1 4 10pts Complete the skeleton hw4 prob1 m to run your coordinate descent algorithm for a sequence of values 01 01 2 0 At 0 we set w 0 X 0 X 1 X 0 y the usual least squares parameter estimate Plot each of the following versus a the training error J w 0 b the regularization penalty kw k1 c the minimized objective J w d the number of non zero parameters kw k0 the l0 norm e the test error Run this experiment for each of the three training data sets Comment on the behaviour of each of these quantities as …

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 867 - Problem Set 4

Sign up for free to view:

Please select your school