CS281B/Stat241B: Advanced Topics in Learning & Decision MakingMultiple Kernels and Reproducing Kernel Hilbert SpacesLecturer: Michael I. Jordan Scribes: Blaine Nelson and Sierra Boyd1 Convex ProgrammingFigure 1: Diagram of Convex Programming2 Multiple KernelsWe continue our discussion on multiple kernels from the previous lecture. Recall,K =PiµiKiK0tr (K) ≤ c12 Multiple Kernels and Reproducing Kernel Hilbert SpacesGiven data pairs {(xi, yi)}ni=1(xi∈ χ and yi∈ {−1, 1}), let G (K) = diag (Y ) · K · diag (Y ) wherediag (Y ) =y1y200...ynThus, the (i,j)-th element of G (K) is given by G (K)ij= yiyjK (xi, xj)The hard-margin SVM associated with with G (K) is represented by the following program:max 2αTe − αTG (K) αs.t.αTy = 0 αi≥ 0e =11...1Moreover, the soft-margin SVM c an be represented by a similar program where the last constraint is boxed0 ≤ αi≤ C.How can we optimize over α and µ?Theorem 1Multiple Kernel Optimization Theorem: The problem of optimizingmaxα2αTe − αT(G (K) + τI) αs.t.αTy ≥ 0 0 ≤ αi≤ Cwith respect to {µi} where τ is a regularization parameter reduces to:min{µi},t,λ,ν,δts.t.tr (K) = 0 K =PiµiKiG (K) + λI e + ν − δ + λy(e + ν − δ + λy)Tt − 2CδTe0ν ≥ 0 δ ≥ 0which is a Semi-Definite Program (SDP ).Proof. Optimizing with respect to {µi} is equivalent to the following program:min{µi}maxα2αTe − αT(G (K) + τI) α≡ min{µi},t(t) ≥ maxα2αTe − αT(G (K) + τI) αMultiple Kernels and Reproducing Kernel Hilbert Spaces 3Taking the dual of this gives the Lagrangian,L (α, ν, λ, δ) = 2αTe − αT(G (K) + τI) α + 2νTα + 2λyTα + 2δT(Ce − α)Now, we maximize over α to find the dual problem. This is done by setting∂L∂αto 0. Thus we find thatα = (G (K) + τ I)−1(e + ν − δ + λy). This yields the following dual problem:minν≥0,δ≥0,λ(e + ν − δ + λy)T(G (K) + τI)−1(e + ν − δ + λy) + 2CδTe| {z }βHence the primal statement is less than or equal to t if and only if ∃ν, δ, λ such that β ≤ t.But, from the Schur Complement Theorem, we know that,A BC D0 ⇔A0D − CA−1B0But D − CA−1B has the exact same form as −β. Moreover, from the problem statement, (G (K) + τ I) 0.Thus, (e + ν − δ + λy)T(G (K) + τI)−1(e + ν − δ + λy) + 2CδTe≺0 if and only if,G (K) + τI e + ν − δ + λy(e + ν − δ + λy)Tt − 2CδTe0Thus, this condition implies that β ≤ t, which is equivalent to the original min-max problem.3 Hilbert SpaceA Hilbert space is essentially an Euclidean space, but a Euclidean space that may be infinite-dimensional. Itis a vector space (i.e., is closed under addition and scalar multiplication, obeys the distributive and associativelaws, etc.). It is also endowed with an inner product h·, ·i; a bilinear form obeying the following conditions:hx + y, zi = hx, zi + hy, zihαx, yi = αhx, yihx, yi = hy, xihx, xi ≥ 0hx, xi = 0 =⇒ x = 0From h·, ·i we get a norm k · k via k x k∆= hx, xi1/2. This norm allows us to define notions of convergence.Adding all limit points of Cauchy sequences (the distance between elements xiand xi+1of the sequencebecomes small as i → ∞) to our space yields a Hilbert space—a complete inner product space.Theorem 2Cauchy-Schwartz:hx, yi ≤k x kk y kis easily proved for any Hilbert space.4 Multiple Kernels and Reproducing Kernel Hilbert Spaces3.1 Examples• Rn: hx, yi = xTy• L2: hx, yi =Rx(t)y(t)dt• l2: hx, yi =P∞i=1xiyi4 Reproducing Kernel Hilbert SpacesThe Hilbert space L2is too “big” for our purposes, containing too many non-smooth functions. One approachto obtaining restricted, smooth spaces is the Repro ducing Kernel Hilbert Space (RKHS) approach. A RKHSis “smaller” than a general Hilbert space.Given a kernel k(x, x0), we will construct a Hilbert space such that k defines an inner product in that space.First define the Gram matrix. Given points x1, x2, x3, ...xn, define:Kij= k(xi, xj)We say that the kernel k is positive definite (p.d.) if its Gram matrix is positive definite for all x1, x2, ..., xn.Similarly, we define positive semidefinite (p.s.d.), negative definite, and negative semidefinite by the proper-ties of the Gram matrix.The Cauchy-Schwartz inequality holds for p.s.d. kernels:k(x1, x2)2≤ k(x1, x1)k(x2, x2)Proof : Form a Gram matrix of the two points x1and x2:K =k(x1, x1) k(x1, x2)k(x2, x1) k(x2, x2)For K to be positive semidefinite as a matrix, the determinant of K must be nonnegative:=⇒ k (x1, x1)k(x2, x2) − k(x2, x1)k(x1, x2) ≥ 0,which implies Cauchy-Schwartz.Define the following reproducing kernel map for a kernel k (·, ·):Φ : x −→ k(·, x).I.e., to each point x in the original space we associate a function k(·, x).Example: Gaussian kernel. Each point x maps to a Gaussian centered at that point as depicted in Fig. 2.Intuitively this captures the similarity of x to all other points.We now construct a vector space containing all linear combinations of the functions k(·, x):f(·) =mXi=1αik(·, xi).for arbitrary {xi∈ χ}mi=1. This will be used to construct our RKHS H.Multiple Kernels and Reproducing Kernel Hilbert Spaces 5Figure 2: The mapping of an input space to a gaussian feature space.We now define an inner product. Let f (·) =Piαik (·, xi) and g (·) =Pjβjk·, x0jsuch that f, g ∈ H, anddefine:hf, gi∆=mXi=1m0Xj=1αiβjkxi, x0jWe need to verify that this in fact defines an inner product. Symmetry is obvious: hf, gi = hg, f i. Linearityis easy to show. In the next lecture we will establish the key property: hf, f i = 0 =⇒ f = 0.Kernels are analogs of Dirac delta functions. Consider L2(which is not a RKHS). We have:f(x) =Zf(t)δ(t, x)dt,where δ(t, x) is the Dirac delta function. The Dirac delta function is the representer of evaluation for L2,but of course it is not itself in L2. (Which is consistent with the fact that L2is not a
View Full Document