Berkeley COMPSCI C281B - The Representer Theorem - D2633497

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI C281B> The Representer Theorem

DOC PREVIEW

Berkeley COMPSCI C281B - The Representer Theorem

School name University of California, Berkeley

Course Compsci C281b- Advanced Topics in Learning and Decision Making

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS281B/Stat241B: Advanced Topics in Learning & Decision MakingThe Representer TheoremLecturer: Michael I. Jordan Scribes: Xiaofeng Ren1 Addendum on the Gaussian kernelAs covered in a previous lecture, the One-Class SVM Classification aims at novelty/outlier detection in highdimensional spaces. The strategy there is to find a hyperplane in the feature space s.t. the observed data isseparated away from the origin as far as possible. Here is an intuition:The most commonly used kernel in practice is the Gaussian kernel, K(x, x0) = e−12kx−x0k. We can thinkabout the feature map:Φ(x) : x 7→ K(·, x)as a map from a point x to a Gaussian bump around x.Now, let us look more closely at the functions K(·, x) in the feature space. What is the ”length” of a ”vector”Φ(x) for an arbitrary x?kΦ(x)k2H= hK(·, x), K(·, x)iH= K(x, x) = 1by the reproducing property of H.And, what is the angle between any two vectors Φ(x) and Φ(x0)?cos(α) = hΦ(x), Φ(x0)iH= hK(·, x), K(·, x0)iH= K(x, x0) ≥ 0The two facts above show that in the feature space, all the data points lie on the unit sphere within a singlequadrant. This justifies the approach in One-Class SVM, which finds a hyperplane which separates the dataaway from the origin ( see Figure 1 ).2 The Representer TheoremIn the general case, the primal problem P is:minf∈H{C(f, {xi, yi}) + Ω(kfkH)}where {xi, yi}, i = 1, · · · , m are the training data. If the problem satisfies the following conditions:1. the loss function C is pointwise; i.e.,C(f, {xi, yi}) = C({xi, yi, f (xi)})which only depends on {f(xi)}, the values of f at the data points.2. Ω(·) is monotonically increasing.12 The Representer Theoremdataseparating hyperplaneFigure 1: One-Class SVMthen the Representer Theorem ( Kimeldorf & Wahba, 1971 ) states that every minimizer of P admits arepresentation of the formf(·) =mXi=1αiK(·, xi)I.e., the optimal f∗is a linear combination of ( a finite set of ) functions given by the data {xi}.This is a powerful result. It shows that although we search for the optimal solution in an infinite-dimensionalspace ( in fact, if we don’t have the regularization term, then we have infinitely many solutions which exactlygo through all the data ), adding the regularization term reduces the problem to finite-dimensional.One way to prove the result is to use a Fourier basis, but this is complicated, involving calculus of variations.Here we give a simple, coordinate-free proof:Without loss of generality, we assume that the second term in P has the form¯Ω(kfk2H). Since this is just amonotonic transform, and in the original problem we allow for all monotonic functions Ω, we don’t lose anygenerality.Consider the linear subspace HDof H spanned by the functions K(·, xi), i = 1, · · · , m. Every f in the Hilbertspace H has a unique decomposition, a component in the subspace and a component orthogonal to it:f(·) = fk(·) + f⊥(·) =mXi=1αiK(·, xi) + f⊥(·)where f⊥is perpendicular to the subspace HD, i.e., hf⊥, K(·, xi)iH= 0 for all i = 1, · · · , m.Use the reproducing property,f(xj) = hf(·), K(·, xj)i =XiαihK(·, xi), K(·, xj)i + hf⊥(·), K(·, xj)iThe second term vanishes, sof(xj) =XiαiK(xj, xi)The Representer Theorem 3I.e., the values of f at the data points only depend on the coefficients {αi} and not the perpendicularcomponent f⊥.Why is this fact important? Because the loss function C is pointwise, so the first term only depends on thevalues of f at the data points. We can establish equivalence classes for the functions in H s.t. f and f0areequivalent if and only if f(xj) = f0(xj) for all the data xj.The first term in P is the same for all the functions within each equivalence class. For the second term,Ω(kfkH) =¯Ω(kfk2H) =¯Ω(kXiαiK(·, xi)k2H+ kf⊥k2H)Ω is monotonic, so the minimizer of P within each equivalence class, i.e., for αi’s fixed, is the one whichsatisfies kf⊥k = 0.The global minimizer f∗of the primal problem P belongs to some equivalence class and it must be theminimizer within that class. Hence it satisfies kf∗⊥k = 0, i.e., f∗=PiαiK(·, xi).3 Another Way to Understand SVM: L1RegularizationThe framework above does not provide us any intuition why, in SVM, some or most of the αi’s are zero.The regularization in the primal problem of SVM, wTw, is L2-like. As we have seen in the example of linearregression, L1regularization tends to make some of the parameters zero, while L2regularization usuallydoes not.One motivation comes from basis-pursuit denoising ( Chen & Donoho ), which studies the following cost:J(α) =12kf(·) −NXi−1αiϕi(·)k2L2+ λkαkL1The L2norm in the first term can not be calculated exactly. Instead, we approximate it by averaging overthe data points,J(α) ≈12NNXn=1 yn−NXi=1αiϕi(xn)!2+ λNXi=1|αi|This cost term does not look like SVM yet, but the L1regularization does force some of the αi’s to be zero.Girosi pointed out that the following modified cost term does lead to SVM:JSV M(α) =12kf(·) −NXi=1αiK(·, xi)k2H+ λkαkL1where in the first term we use the RKHS norm instead of the L2term. The surprising fact is that with theRKHS norm the first term can be calculated exactly, using the reproducing property:JSV M(α) =12kfk2H−Piαihf(·), K(·, xi)i +12PiPjαiαjhK(·, xi), K(·, xj)i + λkαkL1=12kfk2H−Piαif(xi) +12PiPjαiαjK(xi, xj) + λPi|αi|The first term, the RKHS norm of f, is independent of the αi’s. If we assume that yi= f (xi), i.e., noise-free,the optimization problem becomesminf−Xiαiyi+12XiXjαiαjK(xi, xj) + λXi|αi|4 The Representer Theoremand this is exactly the dual problem of SVM. It has the form of a L2-like term ( in the RKHS ) plus a L1regularizer. Notice that formally it looks very different from the primal problem.4 Introduction to Linear Operators4.1 Linear operatorsFor a given Hilbert space H, a function T : H → H is called a linear operator if:1. T (f1+ f2) = T f1+ T f2, for any f1, f2∈ H;2. T (αf) = α(T f), for any f ∈ H and α scalar.Some examples of linear operators:1. For a finite-dimensional Hilbert space, a linear operator A : g 7→ f can be represented as an n × nmatrix, where n = dim H. Choose a basis {φi} for H, if under this basis g has a representationg =Pigiφiand similarly f =Pifiφi, then A has a matrix representation (aij), such that g = Af iffgi=Pjaijfj.2. Integral operators are linear. For example, (T f)(·) =RK(·, x)f (x)dx, where K is a

View Full Document