MIT 9 520 - Regularized Least Squares - D3036114

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Regularized Least Squares

DOC PREVIEW

MIT 9 520 - Regularized Least Squares

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 39

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Regularized Least SquaresCharlie Frogner1MIT20111Slides mostly stolen from Ryan Rifkin (Google).C. Frogner Regularized Least SquaresSummaryIn RLS, the Tikhonov minimization problem boils down tosolving a linear system (and this is good).We can compute the solution for each of a bunch of λ’s, byusing the eigendecomposition of the kernel matrix.We can compute the leave-one-out error over the wholetraining set about as cheaply as solving the minimizationproblem once.The linear kernel allows us to do all of this when n  d.C. Frogner Regularized Least SquaresBasics: DataTraining set: S = {(x1, y1), . . . , (xn, yn)}.Inputs: X = {x1, . . . , xn}.Labels: Y = {y1, . . . , yn}.C. Frogner Regularized Least SquaresBasics: RKHS, KernelRKHS H with a positive semidefinite kernel function K :linear: K (xi, xj) = xTixjpolynomial: K (xi, xj) = (xTixj+ 1)dgaussian: K (xi, xj) = exp −||xi− xj||2σ2!Define the kernel matrix K to satisfy Kij= K (xi, xj).The kernel function with one argument fixed isKx= K (x, ·).Given an arbitrary input x∗, Kx∗is a vector whose ith entryis K (xi, x∗). (So the training set X is assumed.)C. Frogner Regularized Least SquaresThe RLS SetupGoal: Find the function f ∈ H that minimizes the weightedsum of the square loss and the RKHS normargminf ∈H12nXi=1(f (xi) − yi)2+λ2||f ||2H. (1)This loss function makes sense for regression. We canalso use it for binary classification, where it is lessimmediately intuitive but works great.Also called “ridge regression.”C. Frogner Regularized Least SquaresApplying the RepresenterClaim: We can rewrite (1) asargminc∈Rn12||Y − Kc||22+λ2||f ||2H.Proof: The representer theorem guarantees that the solution to(1) can be written asf (·) =nXj=1cjKxj(·)for some c ∈ Rn.So Kc gives a vector whose ith element is f (xi):f (xi) =nXj=1cjKxi(xj) =nXj=1cjKij= (Ki,·)cC. Frogner Regularized Least SquaresApplying the Representer Theorem, Part IIClaim:kf k2H= cTKc.Proof:f (·) =nXj=1cjKxj(·),so||f ||2H= < f , f >H=*nXi=1ciKxi,nXj=1cjKxj+H=nXi=1nXj=1cicjKxi, KxjH=nXi=1nXj=1cicjK (xi, xj) = ctKcC. Frogner Regularized Least SquaresThe RLS SolutionPutting it all together, the RLS problem is:argminf ∈H12||Y − Kc||22+λ2cTKcThis is convex in c (why?), so we can find its minimum bysetting the gradient w.r.t c to 0:−K(Y − Kc) + λKc = 0(K + λI)c = Yc = (K + λI)−1YWe find c by solving a system of linear equations.C. Frogner Regularized Least SquaresThe RLS Solution, CommentsThe solution exists and is unique (for λ > 0).Define G(λ) = K + λI. (Often λ is clear from context andwe write G.)The prediction at a new test input x∗is:f (x∗) =nXj=1cjKxj(x∗)= Kx∗c= Kx∗G−1YThe use of G−1(or other inverses) is formal only. We donot recommend taking matrix inverses.C. Frogner Regularized Least SquaresSolving RLS, Parameters Fixed.Situation: All hyperparameters fixedWe just need to solve a single linear system(K + λI)c = Y.The matrix K + λI is symmetric positive definite, so theappropriate algorithm is Cholesky factorization.In Matlab, the “slash” operator seems to be usingCholesky, so you can just write c = (K+l*I)\Y, but to besafe, (or in octave), I suggest R = chol(K+l*I); c =(R\(R’\Y));.C. Frogner Regularized Least SquaresSolving RLS, Varying λSituation: We don’t know what λ to use, all otherhyperparameters fixed.Is there a more efficent method than solvingc(λ) = (K + λI)−1Y anew for each λ?Form the eigendecomposition K = QΛQT, where Λ isdiagonal with Λii≥ 0 and QQT= I.ThenG = K + λI= QΛQT+ λI= Q(Λ + λI)QT,which implies that G−1= Q(Λ + λI)−1QT.C. Frogner Regularized Least SquaresSolving RLS, Varying λSituation: We don’t know what λ to use, all otherhyperparameters fixed.Is there a more efficent method than solvingc(λ) = (K + λI)−1Y anew for each λ?Form the eigendecomposition K = QΛQT, where Λ isdiagonal with Λii≥ 0 and QQT= I.ThenG = K + λI= QΛQT+ λI= Q(Λ + λI)QT,which implies that G−1= Q(Λ + λI)−1QT.C. Frogner Regularized Least SquaresSolving RLS, Varying λ, Cont’dO(n3) time to solve one (dense) linear system, or tocompute the eigendecomposition (constant is maybe 4xworse). Given Q and Λ, we can find c(λ) in O(n2) time:c(λ) = Q(Λ + λI)−1QTY,noting that (Λ + λI) is diagonal.Finding c(λ) for many λ’s is (essentially) free!C. Frogner Regularized Least SquaresValidationWe showed how to find c(λ) quickly as we vary λ.But how do we decide if a given λ is “good”?Simplest idea: Use the training set error.Problem: This invariably overfits. Don’t do this!Other methods are possible, but today we considervalidation.Validation means checking our function’s behavior onpoints other than the training set.C. Frogner Regularized Least SquaresValidationWe showed how to find c(λ) quickly as we vary λ.But how do we decide if a given λ is “good”?Simplest idea: Use the training set error.Problem: This invariably overfits. Don’t do this!Other methods are possible, but today we considervalidation.Validation means checking our function’s behavior onpoints other than the training set.C. Frogner Regularized Least SquaresTypes of ValidationIf we have a huge amount of data, we could hold backsome percentage of our data (30% is typical), and use thisdevelopment set to choose hyperparameters.More common is k-fold cross-validation, which means acouple of different things:Divide your data into k equal sets S1, . . . , Sk. For each i,train on the other k − 1 sets and test on the ith set.A total of k times, randomly split your data into a trainingand test set.The limit of (the first kind of) k-fold validation isleave-one-out cross-validation.C. Frogner Regularized Least SquaresLeave-One-Out Cross-ValidationFor each data point xi, build a classifier using theremaining n − 1 data points, and measure the error at xi.Empirically, this seems to be the method of choice when nis small.Problem: We have to build n different predictors, on datasets of size n − 1.We will now proceed to show that for RLS, obtaining theLOO error is (essentially) free!C. Frogner Regularized Least SquaresLeave-One-Out CV: NotationDefine Sito be the data set with the ith point removed:Si= {(x1, y1), . . . , (xi−1, yi−1), *poof*, (xi+1, yi+1), . . . , (xn, yn)}The ith leave-one-out value is fSi(xi).The ith leave-one-out error is yi− fSi(xi).Define LVand LEto be the vectors of leave-one-out valuesand errors over the training set.||LE||22is considered a good empirical

View Full Document