MIT 9 520 - Regularized Least Squares - D1590472

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Regularized Least Squares

DOC PREVIEW

MIT 9 520 - Regularized Least Squares

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 39

This preview shows page 1-2-3-18-19-37-38-39 out of 39 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 39 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Regularized Least SquaresRyan M. RifkinHonda Research Institute USA, Inc.Human Intention Understanding Group2007R. Rifkin Regularized Least SquaresBasics: DataData points S = {(X1, Y1), . . . , (Xn, Yn)}.We let X simultaneously refer to the set {X1, . . . , Xn} andto the n by d matrix whose ith row is Xti.R. Rifkin Regularized Least SquaresBasics: RKHS, KernelRKHS H with a positive semidefinite kernel function k:linear: k(Xi, Xj) = XtiXjpolynomial: k(Xi, Xj) = (XtiXj+ 1)dgaussian: k(Xi, Xj) = exp −||Xi− Xj||2σ2!Define the kernel matrix K to satisfy Kij= k (Xi, Xj).Abusing notation, allow k to take and produce sets:k(X, X ) = KGiven an arbitrary point X∗, k (X , X∗) is a column vectorwhose ith entry is k(Xi, X∗).The linear kernel has special properties, which we discussin detail later.R. Rifkin Regularized Least SquaresThe RLS SetupGoal: Find the function f ∈ H that minimizes the weightedsum of the total square loss and the RKHS normminf ∈H12nXi=1(f (Xi) − Yi)2+λ2||f ||2K. (1)Note that in this formulation, we are minimizing the totalinstead of the average loss. We avoid mucking around withthe factor of 1/n, which can be folded into λ.This loss function “makes sense” for regression. We canalso use it for binary classification, where it “makes nosense” but works great.R. Rifkin Regularized Least SquaresApplying the RepresenterThe representer theorem guarantees that the solution to(1) can be written asf (·) =nXi=1cik(Xi, ·),for some c ∈ Rn.We can therefore rewrite (1) asminc∈Rn12||Y − Kc||22+λ2||f ||2K.R. Rifkin Regularized Least SquaresApplying the Representer Theorem, IIConsider a function of the form:f (·) =nXi=1cik(Xi, ·),For such a function,||f ||2K= < f , f >K=*nXi=1cik(Xi, ·),nXj=1cjk(Xj, ·)+K=nXi=1nXj=1cicjk(Xi, ·), k(Xj, ·)K=nXi=1nXj=1cicjk(Xi, Xj)= ctKcR. Rifkin Regularized Least SquaresThe RLS Solution12||Y − Kc||22+λ2ctKcis clearly convex in c (why?), so we can find its minimumby setting the gradient w.r.t c to 0:−K (Y − Kc) + λKc = 0(K + λI)c = Yc = (K + λI)−1YWe find c by solving a system of linear equations.R. Rifkin Regularized Least SquaresThe RLS Solution, CommentsThe solution exists and is unique (for λ > 0).Define G(λ) = K + λI. (Often λ is clear from context andwe write G.)The prediction at a new test point X∗is:f (X∗) =Xcik(Xi, X∗)= k(X , X∗)tc= YtG−1k(X , X∗).The use of G−1(or other inverses) is formal only. We donot recommend taking matrix inverses.R. Rifkin Regularized Least SquaresSolving RLS, Parameters Fixed.Situation: All hyperparameters fixedWe just need to solve a single linear system(K + λI)c = y.The matrix K + λI is symmetric positive definite, so theappropriate algorithm is Cholesky factorization.In Matlab, the “slash” operator seems to be usingCholesky, so you can just write c = (K+l*I)\Y, but to besafe, (or in octave), I suggest R = chol(K+l*I); c =(R\(R’\Y));.R. Rifkin Regularized Least SquaresSolving RLS, Varying λSituation: We don’t know what λ to use, all otherhyperparameters fixed.Form the eigendecomposition K = QΛQt, where Λ isdiagonal with Λii≥ 0 and QQt= I.G = K + λI= QΛQt+ λI= Q(Λ + λI)Qt,which implies G−1= Q(Λ + λI)−1Qt.R. Rifkin Regularized Least SquaresSolving RLS, Varying λ, Cont’dO(n3) time to solve one (dense) linear system, or tocompute the eigendecomposition (constant is maybe 4xworse). Given Q and Λ, we can find c(λ) in O(n2) time:c(λ) = Q(Λ + λI)−1QtY ,noting that (Λ + λI) is diagonal.Finding c(λ) for many λ’s is (essentially) free!R. Rifkin Regularized Least SquaresValidationWe showed how to find c(λ) quickly as we vary λ.But how do we decide if a given λ is “good”?Simplest idea: Use the training set error.Problem: This frequently overfits.Other methods are possible, but today we considervalidation.Validation means checking our function’s behavior onpoints other than the training set.R. Rifkin Regularized Least SquaresTypes of ValidationIf we have a huge amount of data, we could hold backsome percentage of our data (30% is typical), and use thisdevelopment set to choose hyperparameters.More common is k-fold cross-validation, which means acouple of different things:Divide your data into k equal sets S1, . . . , Sk. For each i,train on the other k − 1 sets and test on the ith set.A total of k times, randomly split your data into a trainingand test set.The limit of (the first kind of) k-fold validation isleave-one-out cross-validation.R. Rifkin Regularized Least SquaresLeave-One-Out Cross-ValidationFor each data point xi, build a classifier using theremaining n − 1 data points, and measure the error at xi.Empirically, this seems to be the method of choice when nis small.Problem: We have to build n different predictors, on datasets of size n − 1.We will now proceed to show that for RLS, obtaining theLOO error is (essentially) free!R. Rifkin Regularized Least SquaresLeave-One-Out CV: NotationDefine Sito be the data set with the ith point removed:Si= {(X1, Y1), . . . , (Xi−1, Yi−1), (Xi+1, Yi+1), . . . , (Xn, Yn)}.The ith leave-one-out value is fSi(Xi).The ith leave-one-out error is Yi− fSi(Xi).Define LOOV and LOOE to be the vectors ofleave-one-out values and errors over the training set.||LOOE||22is considered a good empirical proxy for theerror on future points, and we often want to chooseparameters by minimizing this quantity.R. Rifkin Regularized Least SquaresLOOE derivation, IImagine (hallucinate) that we already know fSi(Xi).Define the vector YiviaYij=Yjj 6= ifSi(Xi) j = iR. Rifkin Regularized Least SquaresLOOE derivation, IISuppose we solve a Tikhonov problem with Yiinstead ofY as the labels. Then fSiis the optimizer:12nXj=1(Yij− f (Xi))2+λ2||f ||2K≥12Xj6=i(Yij− f (Xi))2+λ2||f ||2K≥12Xj6=i(Yij− fSi(Xi))2+λ2||fSi||2K=12nXj=1(Yij− fSi(Xi))2+λ2||fSi||2K.R. Rifkin Regularized Least SquaresLOOE derivation, IIITherefore,ci= G−1YifSi(Xi) = (KG−1Yi)iThis is circular reasoning so far, because we need to knowfSi(Xi) to form Yiin the first place.However, assuming we have already solved RLS for thewhole training set, and we have computedfS(X ) = KG−1Y , we can do something nice . . .R. Rifkin Regularized Least SquaresLOOE derivation, IVfSi(Xi) − fS(Xi) =Xj(KG−1)ij(Yij− Yj)= (KG−1)ii(fSi(Xi) − Yi)fSi(Xi) =fS(Xi) − (KG−1)iiYi1 − (KG−1)ii=(KG−1Y )i− (KG−1)iiYi1 − (KG−1)ii.R. Rifkin Regularized Least SquaresLOOE derivation, VLOOV =KG−1Y −

View Full Document