DOC PREVIEW
Berkeley STAT C241B - Lecture 5

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009Lecture 5 — February 4Lecturer: Martin Wainwright Scribe: Jie TangNote: These lecture notes a re still rough, and have only have been mildly proofread.5.1 Announcements• HW1 due Monday Feb 9th (no late a ssignments)• Minor clarifications on webpage5.2 Today• More on RKHS• Representer Theorem• Examples: max-margin classifier, general hard-marg in SVM, ridge regression5.3 RecapLast time we introduced the concept of a reproducing kernel hilbert space (R KHS). Wediscussed the links between positive definite kernel functions a nd Hilbert spaces, and gavewithout proof a theorem linking every RKHS with a kernel f unction K.A reproducing kernel Hilbert space (RKHS) is a Hilbert space H of functions {f : X → R}s.t. ∀x ∈ X, ∃Rx∈ Hs.t. < Rx, f >= f (x)∀f ∈ H, i.e. there exists a function Rxcalledthe representer of evaluation for anye x ∈ X s.t. we can evaluate f at x by taking the innerproduct in H with the representer of x.Examples of RKHS include finite dimensional spaces and Sobolev spaces (see last lecture)5.4 Correspon dence between RKHS and PSD kerne l sTheorem 5.1. To any RKHS there exists a positive semidefinite kernel function. Con-versely, given any PSD kernel we can construct a RKHS s.t. Rx(.) = k(., x).Note: The kernel function is unique, though it requires more work to prove this.5-1EECS 281B / STAT 241B Lecture 5 — February 4 Spring 2009Proof: Given a RKHS, define a kernel function via K(x, y) :=< Rx, Ry>H∀x, y ∈ X. (Bythe def’n of an RKHS, there is a representer Rx∀x ∈ X). This is a valid kernel functiononly if it is positive semidefinite Given any x1, ..., xn∈ X, and any a ∈ Rn, we need to showaTKa ≥ 0 where Kij= K(xi, xj)Expanding out the quadratic f orm, we haveaTKa =nXi,j=1aiajK(xi, xj) (5.1)=nXi,j=1aiaj< Rxi, Rxj>H(5.2)= ||nXi=1aiRxi||2H≥ 0 (5.3)(5.4)Conversely, say we have a PSD kernel K. We must construct a space s.t. K(., x) is therepresenter, which has reproducing property, and is a Hilbert space (complete).Define a linear space of functions:L = span{K(., x)|x ∈ X} =(nXi=1aiK(., xi)|n ∈ N, a ∈ Rn, { x1, ..., xn} ⊂ X)(5.5)Define an inner product on this space:<nXi=1aiK(., xi),mXj=1bjK(., yj) >=Xi,jaibjK(xi, xj) (5.6)Because the matrix K is positive semidefinite, this expression must be ≥ 0, and this innerproduct is positive semidefinite.Next, we must check that K(., x) has the representer property:< K(., x), f (.) >=< K(., x),mXj=1bjK(., yi) >=mXj=1bjK(x, yi) = f(x) (5.7)Finally, we must make sure the space H is complete, i.e. that it is actually a Hilbertspace)Let fnbe a Cauchy sequence in L i.e. ∀ǫ > 0, ∃N(ǫ)s.t.∀n, m ≥ N(ǫ)||fn− fm||H< ǫFor any x ∈ X, fn, fm, we can check that the Cauchy sequence converges t o some optimumf∗.|fn(x) − fm(x)| = | < K(., x), fn− fm>H|≤ ||K(., x)||H||fn− fm||H5-2EECS 281B / STAT 241B Lecture 5 — February 4 Spring 2009where the last line above follows by Cauchy Schwarz. Since ||fn− fm|| < ǫ, we have that|fn(x) − fm(x)| < ǫ a nd fn(x) → f∗(x) ∀x ∈ X(This is special property of the representer of evaluation: in general functions such anoptimization can converge to a limit without converging pointwise).To ensure that our Hilbert space is complete, we can add all limits fn→ f∗ to our space.Our final result is:H ={spanK(., x)|x ∈ X} (5.8)where X denotes the closure of the space X. 5.5 Applications o f RKHSHow do we use the RKHS machinery to develop classifiers or other statistical estimators?5.5.1 Max-margin ClassifiersConsider a generalization of hard-margin SVMs. Recall that to train an SVM we solve thefollowing quadratic program.min12||θ||22s.t. yi< θ, xi> ≥ 1 ∀i = 1, ..., nAssuming linearly separable training data, this q. program maximizes the margin betweenthe decision boundary and the nearest training examples. These are called max-marginclassifiers. Maximizing the margin is appealing because it minimizes the expected risk ofclassifying unknown data.Note: we can also introduce slack variables to account for nonseparable data; hw problem1.5.We generalize this algorithm by performing the optimization over an RKHS H.min12||θ||2Hs.t. yif(xi) ≥ 1 ∀i = 1, ..., nOur classification rule becomes f (xi) ≥ 1 for some function f in the RKHS asso ciatedwith our kernel. The loss function for classification is given byL(xi, yi, f (xi)) =nXi=1I(yif(xi) ≥ 1)5-3EECS 281B / STAT 241B Lecture 5 — February 4 Spring 2009Instead of just linear functions, we can now use po lynomial kernels, meaning richer deci-sion boundaries can be fit by this classifier. However, more powerful classifiers also run therisk of overfitting.5.5.2 Lagrangian dualityIn the original SVM, we can always find a solution θ =Piαiyixi, where α ∈ Rncomes fromthe dual problem. In the generalized SVM algorithm, the space we are optimizing over couldbe an infinite dimensional space. But, we can show (via the representer theorem, next time)that the solution always takes the form:f(.) =nXi=1αiyiK(., xi)Note: In many settings, the αiare likely to be sparse, i.e. contain many zeroes. Onlya few of the data points xiwith nonzero α must be involved in the computation: these aresuppo r t vectors.5.6 Representer Theore mTheorem 5.2. Let Ω : [0, ∞] → R be strictly increasing and let l : (X ×Y ×R)n→ R∪{∞}be a loss function. Consider minf∈Hl((xi, yi, f (xi))) + λnΩ(||f||2H) (think of Ω as identity),λn> 0, and Y is {−1, 1} in classification. Ω is a regularization operator which puts a penaltyon more complicated functions in our space.Then any optimal solution has the form:f(.) =nXi=1αiK(., xi) (5.9)Intuitively, the loss function only depends on observed data points, so the optimal solutionfunction should only depend on kernel functions centered at the observed data points.Proof: (due to Kimeldorf & Wahba 1970s, proved for kernel ridge regression. Later, Smolaand Scholkopf derive a more general version, using the same ideas.)Any f ∈ H can be written asf =nXi=1αik(., xi) + f⊥where f⊥= V⊥, V =span{K(., xi), i = 1, ..., n}We will show that the f⊥component is 0.5-4EECS 281B / STAT 241B Lecture 5 — February 4 Spring 2009By the representer property of RKHS, ∀j = 1, ..., nf(xj) = < K(., xj), f >H=nXi=1αiK(xj, xi)+ < K(., xj), f⊥>HNote that K(., xj) ∈ V, f⊥∈ V⊥, so their inner product is 0 and the second term is zero.By the Pythagorean Theorem,Ω(||f||2H)


View Full Document

Berkeley STAT C241B - Lecture 5

Download Lecture 5
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 5 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 5 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?