DOC PREVIEW
MIT 6 867 - Problem Set 4

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

6.867 Machine LearningProblem Set 4Due Friday 11/7Please address all questions and comments about this problem set to [email protected] will need to use MATLAB for some of the problems but most of the code is provided.If you are not familiar with MATLAB, please consulthttp://www.ai.mit.edu/courses/6.867/matlab.html and the links therein.Problem 1: Regularized Least-Squares Feature SelectionIn this problem we consider a regularized approach to feature selection in a simple regressioncontext. Suppose we have training inputs X = (x1, . . . , xn)0and corresponding outputsy = (y1, . . . , yn)0, where yi∈ R. We train a linear predictor ˆy(x; w) = w0φ(x) basedon a collection of M features φ(x) = (φ1(x), . . . , φM(x))0. The estimation criterion is thefollowing regularized least-squares objective:J(w; λ) =1nnXi=112(yi− w0φ(xi))2+ λkwk1(1)where kwk1is the l1normkwk1=MXk=1|wk| (2)We seek the optimum parametersˆw =ˆw(λ) (functions of λ) that minimize the regularizedobjective.ˆw = arg minw∈RMJ(w; λ) (3)The use of the l1norm penalty leads to a sparse solutions such that, as λ becomes large,many of the parametersˆw are forced to zero. The corresponding features are then ignored(irrelevant) in the prediction ˆy =ˆw0φ(x).To solve this l1-regularized least-squares problem, we consider a simple coordinate descentapproach to optimization. In this approach we adjust one parameter at a time so as tominimize the objective while keeping the remaining parameters fixed.ˆwk= arg minwkJ(w; λ) (4)1By letting k iterate over the set of all indices {1, . . . , M}, successively refining single pa-rameters, and repeating this iteration over parameters many times, the parameter vectorasymptotically converges to the global minimum of the convex objective function.In general, the coordinate descent method requires a subroutine to perform line minimiza-tion (minimization along the chosen coordinate). Here we can solve the line minimizationin closed form.The regularization penalty is not smooth, however, so the typical approach of settingderivatives to zero needs to be broken down into cases. While this is fairly simple in ourregression context here, we follow a slightly more general approach, one that will be usefulfor solving other problems of this type.How do we take derivatives of non-differentiable functions? The subdifferential of a convexfunction f(w) is defined as∂f(w) = {s|f (w + ∆) ≥ f(w) + s∆, ∀ ∆ ∈ R} (5)In other words, a subdifferential is the range of slopes s such that the line through (w, f(w))with slope s “supports” f , e.g. contains the graph of f in it’s upper half-space. This isa set-valued generalization of the normal derivative and reduces to the normal derivative∂f(w) = {∂f(w)∂w} whenever f is differentiable.For example, the subdifferential of the absolute value function f(w) = |w| (which is notdifferentiable at w = 0) is∂f(w) ={−1}, w < 0[−1, +1], w = 0{+1}, w > 0(6)We will use the following result from non-smooth analysis:Optimality Condition: ˆw is a global minimizer of a convex function f(w) if and only if0 ∈ ∂f( ˆw).For example, the optimality condition 0 ∈ ∂f(w) for the absolute value function f(w) = |w|holds if and only if w = 0. Hence, w = 0 is the (unique) global minimize r of |w|.Below, we guide you through the analysis to solve the line minimization problem (4).(1-1) (10pts) Show that the subdifferential of J(w; λ) with respect to parameter wkis∂wkJ(w; λ) = (akwk− ck) + λ∂wk|wk| (7)={akwk− (ck+ λ)}, wk< 0[−ck− λ, −ck+ λ], wk= 0{akwk− (ck− λ)}, wk> 0(8)withak=1nXiφ2k(xi) (9)2ck=1nXiφk(xi)(yi− w0−kφ−k(xi)) (10)where w−k(respectively φ−k) denote the vector of all parameters (features) except forparameter wk(feature φk). (Hint. The subdifferential is given by∂J(w;0)∂wk+ λ∂wk|wk|since the mean squared term is differentiable. ). Observe that the numbers akarenon-negative (and typically positive) and constant w.r.t. the parameters. Each ckdepends on all parameters except for the parameter wkthat we wish to update. Canyou interpret the coefficient ck?(1-2) (10pts) Solve the nonsmooth optimality condition for ˆwks.t. 0 ∈ ∂wkJ( ˆwk). You mayfind it helpful to consider each of the following cases separately:(a) ck< −λ(b) ck∈ [−λ, +λ](c) ck> +λIn each case, provide a hand-drawn plot ∂wkJ(wk) versus wkand label the zero-crossing ˆwk. Express ˆwkas a function of ak, ckand λ. Also, provide a hand-drawnplot of ˆwkversus ck. What role does the regularization parameter λ play in thiscontext relative to the coefficient ck?We have provided you with training and test data in the file reg least sq.mat. Load thisinto MATLAB via load reg least sq. You should then find four structures train small,train med, train large and test each containing an n × m matrix X and a n × 1 vectory. Below, you will write code to implement and test the coordinate descent method forsolving the l1-regularized least squares problem (ske letons provided). For simplicity, wewill only consider linear regression, e.g. set φ(x) = x.(1-3) (10pts) First, write a MATLAB subroutine based on the skeleton reg least sq.m tosolve forˆw given the training data (X, y), regularization parameter λ and an initialguess w0to seed the coordinate descent algorithm. Your procedure should repeatedlyiterate over k ∈ {1, . . . , M} and set wk= ˆwk(use your earlier result). All but theevaluation of ckand ˆwkis already provided in the skeleton reg least sq.m.Test your procedure by evaluatingˆw(1) for any of the training sets. Initialize yourprocedure with w0= 0.In practice, it is often useful to solve for ˆw(λ) for a range of λ values rather than for justone particular λ. For instance, we typically do not know what value of λ to us e but mightinstead have some idea as to how large training error is still acceptable or how large kwk1the penalty can be. This requires that we try many values of λ. One could also set λ basedon cross-validation or other generalization error analysis.3(1-4) (10pts) Complete the skeleton hw4 prob1.m to run your coordinate descent algorithmfor a sequence of λ values [.01 : .01 : 2.0]. At λ = 0, we setˆw(0) = (X0X)−1X0y, theusual least-squares parameter estimate.Plot each of the following ve rsus λ:(a) the training error J(ˆw(λ); 0)(b) the regularization penalty kˆw(λ)k1(c) the minimized objective J(ˆw(λ); λ)(d) the number of


View Full Document

MIT 6 867 - Problem Set 4

Download Problem Set 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Problem Set 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Problem Set 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?