MIT 9 520 - Regularized Least Squares - D2997190

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Regularized Least Squares

DOC PREVIEW

MIT 9 520 - Regularized Least Squares

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 29

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Regularized Least Squares9.520 Class 04, 21 February 2006Ryan RifkinPlan• Introduction to Regularized Least Squares• Computation: General RLS• Large Data Sets: Subset of Regressors• Computation: Linear RLSRegressionWe have a training set S = {(x1, y1), . . . , (xℓ, yℓ)}. The yiare real-valued. The goal is to learn a function f to predictthe y values associated with new observed x values.Our Friend Tikhonov RegularizationWe pose our regression task as the Tikhonov minimizationproblem:f = arg minf∈H12ℓXi=1V (f(xi), yi) +λ2kfk2KTo fully specify the problem, we need to choose a lossfunction V and a kernel function K.The Square LossFor regression, a natural choice of loss function is thesquare loss V (f (x), y) = (f (x) − y)2.−3 −2 −1 0 1 2 30123456789y−f(x)L2 lossSubstituting In The Square LossUsing the square loss, our problem becomesf = arg minf∈H12ℓXi=1(f(xi) − yi)2+λ2kfk2K.The Return of the Representer TheoremTheorem. The solution to the Tikhonov regularizationproblemminf∈H12ℓXi=1V (yi, f(xi)) +λ2kfk2Kcan be written in the formf =ℓXi=1ciK(xi, ·).This theorem is exceedingly useful — it says that to solvethe Tikhonov regularization problem, we need only find thebest function of the form f =Pℓi=1ciK(xi). Put differently,all we have to do is find the ci.Applying the Representer Theorem, INOTATION ALERT!!! We use the symbol K for thekernel function, and boldface K for the ℓ-by-ℓ matrix:Kij≡ K(xi, xj)Using this definition, consider the output of our functionf =ℓXi=1ciK(xi, ·).at the training point xj:f(xj) =ℓXi=1K(xi, xj)ci= (Kc)jUsing the Norm of a “Represented”FunctionA function in the RKHS with a finite representationf =ℓXi=1ciK(xi, ·),satisfieskfk2k=*ℓXi=1ciK(xi, ·),ℓXj=1cjK(xj, ·)+=ℓXi=1ℓXj=1cicjDK(xi, ·), K(xj, ·)E=ℓXi=1ℓXj=1cicjK(xi, xj)= ctKc.The RLS ProblemSubstituting, our Tikhonov minimization problem becomes:minc∈Rℓ12kKc − yk22+λ2cTKc.Solving the Least Squares Problem, IWe are trying to minimizeg(c) =12kKc − yk22+λ2cTKc.This is a convex, differentiable function of c, so we canminimize it simply by taking the derivative with respect toc, then setting this derivative to 0.∂g(c)∂c= K(Kc − y) + λKc.Solving the Least Squares Problem, IISetting the derivative to 0,∂g(c)∂c= K(Kc − y) + λKc = 0→ K(Kc) + λKc = Ky“ → ” Kc + λc = y→ (K + λI)c = y→ c = (K + λI)−1yThe matrix K + λI is positive definite and will be well-conditioned if λ is not too small.Leave-One-Out ValuesRecalling that S = {(x1, y1), . . . , (xℓ, yℓ)}, we define fStobe the solution to the RLS problem with training set S.We defineSi= {S\xi}= {(x1, y1), . . . , (xi−1, yi−1), (xi+1, yi+1), . . . , (xℓ, yℓ)},the training set with the ith point removed.We call fSi(xi) the ith LOO value, and yi− fSi(xi) the ithLOO error. Let LOOV and LOOE be vectors whose ithentries are the ith LOO value and error.Key Intuition: if kLOOEk is small, we will generalize well.The Leave-One-Out FormulaRemarkably, for RLS, there is a closed form formula forLOOE. Defining G(λ) = (K + λI)−1, we have:LOOE =G−1ydiag(G−1)=cdiag(G−1).Proof: Later, Blackboard.Computing: Naive ApproachSuppose I want to minimize kLOOEk, testing p differentvalues of λ.I form K, which ta kes O(n2d) time (I assume d-dimensionaldata and linear-time kernels throughout).For each λ, I form G, I form G−1(O(n3) time), and com-pute c = G−1y and diag(G−1).Over p values of λ, I will pay O(pn3) time.We can do much better...Computing: Eigendecomposing KWe form the eigendecomposition K = QΛQt, where Λ isdiagonal with Λii≥ 0 and QQt= I.Key point:G = K + λI= QΛQt+ λI= Q(Λ + λI)Qt,and G−1= Q(Λ + λI)−1Qt.Forming the eigendecomposition ta kes O(n3) time (in prac-tice).Computing c and LOOE efficientlyc(λ) = G(λ)−1y= Q(Λ + λI)−1Qty.G−1ij= (Q(Λ + λI)−1Qt)ij=nXk=1QikQjkΛkk+ λ,Given the eigendecomposition, I can compute c, diag(G−1),and LOOE in O(n2) time. Over p values of λ, I pay onlyO(n3+ pn2). If p < n, searching for a good λ is effectivelyfree!Nonlinear RLS, Suggested Approach• 1. Form the eigendecomposition K = QΛQt.• 2. For each value of λ over a logarithmically spacedgrid, compute c = Q(Λ+λI)−1Qty and diag(G−1) usingthe formula for the last slide. Form LOOE, a vectorwhose ith entry iscidiag(G−1)i.• 3. Choose the λ that minimizes kLOOEk in some norm(I use L2).• 4. Given that c, regress a new test point x with f (x) =PiciK(xi, x).Limits of RLSRLS has proved very accurate. There are two computa-tional problems:• Training: We need O(n2) space (to store K), andO(n3) time (to eigendecompose K)• Testing: Testing a new point x takes O(nd) time tocompute the n kernel products in f (x) =PiK(x, xi).Next class, we will see that an SVM has a sparse solu-tion, which gives us large constant factor (but importantin practice!) improvements for both the training and test-ing problems.Can we do better, sticking with RLS?First Idea: Throw Away DataSuppose that we throw away all but M of our data points,where M << ℓ. Then we only need time M2d to form ournew, smaller kernel matrix, and we only need time O(M3)to solve the problem. Great, isn’t it?Well, if we have too much data to begin with (say 1,000,000points in 3 dimensions) this will work just fine. In general,we will lose accuracy.Subset of RegressorsSuppose, instead of throwing away data, we restrict ourhypothesis space further. Instead of allowing functions ofthe formf(x) =ℓXi=1ciK(xi, x),we only allowf(x) =MXi=1ciK(xi, x),where M << ℓ. In other words, we only allow the first Mpoints to have non-zero ci. However, we still measure theloss at all ℓ points.Subset of Regressors, Cont’dSuppose we define KMMto be the kernel matrix on justthe M points we’re using to represent our function, andKMLto be the kernel product between those M pointsand the entire dataset, we can derive (homework) that theminimization problem becomes:(KMLKLM+ λKMM)c = KMLy,which is again an M-by-M system.Various authors have reported good results with this orsimilar, but the jury is still out (class project!). Sometimescalled Rectangular Method.λ is Still FreeTo solve(KMLKLM+ λKMM)c = KMLy,consider a Cholesky factorization KMM= GGt:(KMLKLM+ λKMM)c = KMLy→ (KMLKLM+ λGGt)c = KMLy→ (KMLKLM+ λGGt)G−tGtc = KMLy→ (KMLKLMG−t+ λG)Gtc = KMLy→ G(G−1KMLKLMG−t+ λI)Gtc = KMLy,and we use the “standard” RLS free-λ algorithm on

View Full Document