MIT 9 520 - Sparsity Based Regularization - D2035239

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Sparsity Based Regularization

DOC PREVIEW

MIT 9 520 - Sparsity Based Regularization

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 32

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Sparsity Based RegularizationLorenzo Rosasco9.520 Class 12March 14, 2011L. Rosasco Sparsity Based RegularizationAbout this classGoal To introduce sparsity based regularization withemphasis on the problem of variable selection. Todiscuss its connection to sparse approximationand describe some of the methods designed tosolve such problems.L. Rosasco Sparsity Based RegularizationPlansparsity based regularization: finite dimensional caseintroductionalgorithmstheoryL. Rosasco Sparsity Based RegularizationSparsity Based Regularization?interpretabilty of the model: a main goal besides goodprediction is detecting the most discriminative informationin the data.data driven representation: one can take a large,redundant set of measurements and then use a datadriven selection scheme.compression: it is often desirable to have parsimoniousmodels, that is models requiring a (possibly very) smallnumber of parameters to be described.More generally if the target function is sparse enforcing sparsityof the solution can be a way to avoid overfitting.L. Rosasco Sparsity Based RegularizationA Useful ExampleBiomarker IdentificationSet up:• n patients belonging to 2 groups (say two different diseases)• p measurements for each patient quantifying the expressionof p genesGoal:• learn a classification rule to predict occurrence of the diseasefor future patients• detect which are the genes responsible for the diseaseHigh Dimensional learningp  n paradigm: typically n is in the order of tens and p ofthousands....L. Rosasco Sparsity Based RegularizationSome NotationMeasurement matrixLet X be the n × p measurements matrix.X =x11. . . . . . . . . xp1...............x1n. . . . . . . . . xpn• n is the number of examples• p is the number of variables• we denote with Xj, j = 1, . . . , p the columns of XFor each patient we have a response (output) y ∈ R or y = ±1.In particular we are given the labels for the training setY = (y1, y2, . . . , yn)L. Rosasco Sparsity Based RegularizationApproaches to Variable SelectionSo far we still have to define what are "relevant" variables.Different approaches are based on different way to specify whatis relevant.Filters methods.Wrappers.Embedded methods.We will focus on the latter class of methods.(see "Introduction to variable and features selection" Guyon andElisseeff ’03)L. Rosasco Sparsity Based RegularizationEmbedded MethodsThe selection procedure is embedded in the training phase.An intuitionwhat happens to the generalization properties of empirical riskminimization as we discard variables?if we keep all the variables we probably overfit,if we take just a few variables we are likely to oversmooth(in the limit we have a single variable classifier).We are going to discuss this class of methods in detail.L. Rosasco Sparsity Based RegularizationSparse Linear ModelSuppose the output is a linear combination of the variablesf (x) =pXi=1βixi= hβ, xieach coefficient βican be seen as a weight on the i-th variable.SparsityWe say that a function is sparse if most coefficients in theabove expansion are zero.L. Rosasco Sparsity Based RegularizationSolving a BIG linear systemIn vector notation we can can write the problem as a linearsystem of equationY = X β.The problem is ill-posed.L. Rosasco Sparsity Based RegularizationSparsityDefine the `0-norm (not a real norm) askβk0= #{i = 1, . . . , p | βi6= 0}It is a measure of how "complex" is f and of how manyvariables are important.L. Rosasco Sparsity Based Regularization`0RegularizationIf we assume that a few variables are meaningful we can lookforminβ∈RP{1nnXj=1V (yj,β, xj) + λ kβk0}this corresponds to the problem of best subset selection is hard.This problem is not computationally feasible⇒ It is as difficult as trying all possible subsets of variables.Can we find meaningful approximations?L. Rosasco Sparsity Based RegularizationApproximate solutionsTwo main approachesApproximations exist for various loss functions usually basedon:1Convex relaxation.2Greedy schemes.We mostly discuss the first class of methods (and consider thesquare loss).L. Rosasco Sparsity Based RegularizationConvex RelaxationA natural approximation to `0regularization is given by:1nnXj=1V (yj,β, xj) + λ kβk1where kβk1=Ppi=1|βi|.If we choose the square loss1nnXj=1(yj−β, xj)2= kY − X βk2nsuch a scheme is called Basis Pursuit or Lasso algorithms.L. Rosasco Sparsity Based RegularizationWhat is the difference with Tikhonov regularization?We have seen that Tikhonov regularization is a good wayto avoid overfitting.Lasso provides sparse solution, Tikhonov regularizationdoesn’t.Why?L. Rosasco Sparsity Based RegularizationTikhonov RegularizationWe can go back to:minβ∈Rp{1nnXj=1V (yj,β, xj) + λpXi=1|βi|2}How about sparsity?⇒ in general all the βiin the solution will be different from zero.L. Rosasco Sparsity Based RegularizationConstrained MinimizationConsiderminβ{pXi=1|βi|}subject tokY − X βk2n≤ R.L. Rosasco Sparsity Based RegularizationGeometry of the Problemβ1β2"1"2θR−R"1+ "21L. Rosasco Sparsity Based RegularizationBack to `1regularizationWe focus on the square loss so that we now have to solveminβ∈RpkY − βXk2+ λ kβk1.Though the problem is no longer hopeless it is nonlinear.The functional is convex but not strictly convex, so that thesolution is not unique.Many possible optimization approaches (interior pointmethods, homotopy methods, coordinate descent...)Using convex analysis tools we can get a simple, yetpowerful, iterative algorithm.L. Rosasco Sparsity Based RegularizationAn Iterative Thresholding AlgorithmIt can be proved that the following iterative algorithm convergesto the solution βλof `1regularization as the number of iterationincreases.Set βλ0= 0 and letβλt= Sλ[βλt−1+ τ XT(Y − X βλt−1)]where τ is a normalization constant ensuring that τ kX k ≤ 1and the map Sλis defined component-wise asSλ(βi) =βi+ λ/2 if βi< −λ/20 if |βi| ≤ λ/2βi− λ/2 if βi< λ/2(see Daubechies et al.’05)L. Rosasco Sparsity Based RegularizationThresholding FunctionSλ(β)γλwγ/2−λwγ/2Figure 1. Balancing Principle.1L. Rosasco Sparsity Based RegularizationAlgorithmics AspectsSet βλ0= 0for t=1:tmaxβλt= Sλ[βλt−1+ τ XT(Y − X βλt−1)]The algorithm we just described is very easy to implementbut can be quite slow but speed-ups are possible using: 1)adaptive step-size, 2) continuation methods.The number of iteration t

View Full Document