DOC PREVIEW
ILLINOIS CS 446 - 100517.2

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS446: Machine Learning, Fall 2017Lecture 12: Kernel Ridge Regression; Support Vector MachineLecturer: Sanmi Koyejo Scribe: Lan Wang, Oct. 5th, 2017Agenda— Recap (Kernels)— Kernel ridge regression— Support vector machinesRecall• Kernel function:We define a kernel function to be a real-valued function of two arguments used tomeasure similarity between any pair of data inputs xi, xj∈ χ:κ : χ × χ → R(xi, xj) 7→ φ(xi)Tφ(xj).Here,φ:Rd→ Rnis the feature mapping function, wheredis the dimension of eachdata input and n is the number of data inputs. Usually, we have d  n.• Mercer Kernel[2]:Mercer kernel is a kernel function that satisfies the requirement that theGram matrix,defined byK =κ(x1, x2) . . . κ(x1, xn)κ(x2, x1) . . . κ(x2, xn). . .κ(xn, x1) . . . κ(xn, xn)is positive semi-definite for any set of data input {xi}ni=1.• Gaussian Kernel:Gaussian kernel, which is also called RBF (radial basis function) kernel, is a popularkernel function that is defined as the following:κ(xi, xj) = e−||xi−xj||222σ2= e−γ||xi−xj||22,12 12: Kernel Ridge Regression; Support Vector Machinewhere γ =1σ2. Notice that by taylor expansion, we havee−(x−x0)2=∞Xk=02k(x)k(x0)kk!e−x2e(x0)2.So we can rewrite κ(xi, xj) ase−γ||xi−xj||22=∞Xk=0 (2γ)k(xi)k(x0j)kk!!e−γ||xi||22· e−γ||xj||22. (1)This implies we have infinite dimensional feature maps:φk(xi) = e−γ||xj||22· ηk(xi) ,where ηkis a function of xiderived from (1).Ridge RegressionRidge Regression is a technique for analyzing multiple regression data that suffer frommulticollinearity (i.e., near-linear relationships among the independent variables). Tounderstand ridge regression, let’s first recall the linear regression model that we discussedbefore:minωnXi=1(yi− xTiω)2+ λ||ω||22.LetXn×d=xT1...xTn,then the optimized solution isω∗= (XTX + λI)−1XTy .Using the fact that (I + AB)−1A = A(I + BA)−1, we can rewrite ω∗asω∗= XT(λI + XXT)−1y .Now, we have two ways to get the optimized solution:• ω∗= (XTX + λI)−1XTy. Computation cost: O(d3).• α = (XXT+ λI)−1y, ω∗= XTα. Computation cost: O(n3).Remark: Clearly, we can see from here that ω∗is in the row space of X.12: Kernel Ridge Regression; Support Vector Machine 3When d < n, using the first method is more efficient.Based on the above analysis, we havef(xn+1) = ωTxn+1= xTn+1ω = xTn+1XTα =nXi=1(xTn+1, xi)αi.Now, let’s consider ridge regression using feature mappings:nmini=1(yi− ωTφ(xi))2+ λ||ω||22(2)LetΦn×m=φ(x1)T...φ(xn)T,then similarly, we have two ways to get the optimized solution:• ω∗= (ΦTΦ + λI)−1ΦTy• α = (ΦΦT+ λI)−1y, ω∗= ΦTα.So thatf(xn+1) = ωTΦ(xn+1) =nXi=1φ(xn+1)Tφ(xi)αi.Define a kernel as the following:K := ΦΦT=φ(x1)T...φ(xn)Thφ(x1) . . . φ(xn)i=φ(x1)Tφ(x1) . . . φ(x1)Tφ(xn)...φ(xn)Tφ(x1) . . . φ(xn)Tφ(xn).Then we can rewrite ||ω||22, α and f as:||ω||22= αTΦΦTα = αTKα, α = (K + λI)−1y ,f(x) = KT∗α =nXi=1κ(x, xi)αi.Here, K∗= [κ(x, x1), . . . , κ(x, xn)]T= [φ(x)Tφ(x1), . . . , φ(x)Tφ(xn)]T.Now we are ready to write the original least square problem (2) as the following equivalentproblem:minα||y − Xα||22+ λαTKα. (3)And the optimized solution isα∗= (K + λI)−1y .4 12: Kernel Ridge Regression; Support Vector MachineWhen φ(x) ∈ Rmand m ≥ n, problem (3) is more efficient.Remark : One reason that the kernel is useful is that we can use only it to construct complexφ in complicated models like the natural network:Figure 1: 1-layer natural networkSupport Vector Machines (SVMs)A Support Vector Machine is a discriminative classifier defined by a separating hyperplane.In other words, given labeled training data (supervised learning), the algorithm outputs anoptimal hyperplane which categorizes new data.[1]Now the question is that in which sense is the hyperplane obtaned optimal? Intuitively,we want the hyperplane to separate points with different labels as much as possible. So areasonable definition of optimal hyperplane is the following:Optimal hyperplane: A hyperplane that maximizes the margin (defined by the distancefrom the data point to a decision boundary) of the training data.Figure 2: linear separable training data[1]12: Kernel Ridge Regression; Support Vector Machine 5Now, let’s define a hyperplane formally by a function f:f(x) = ωTx + ω0.The optimal hyperplane can be represented in an infinite number of different ways by scalingofωandω0. As a matter of convention, among all possible representations of the hyperplane,the one chosen is|ωTx + ω0| = 1 .where x symbolizes the training data closest to the hyperplane.So we have a linear classifer:f(x) ≥ 1 if y = 1f(x) < −1 if y = −1By the distance formula, we can compute the distance between any data pointxthat satisfiesωTx + ω0= 1 to the hyperplane:d+=|ωTx + ω0|||ω||2=1||ω||2.Similarly, for any data point x that satisfies ωTx + ω0= −1, the distance isd−=1||ω||2.So we haveMargin = d++ d−=2||ω||2.To find the optimal hyperplane, we need to solve the following optimzed problem:maxω2||ω||22such that yi(ωTxi+ ω0) − 1 ≥ 0 for any i .An equivalent problem isminω||ω||22such that yif(xi) − 1 ≥ 0 for any i .Now let’s solve this problem using the method of Lagrange multipliers:minω,ω0 12||ω||22−nXi=1αi(yif(xi) − 1)!.The solution for this problem isω =Xiαi· yi· xi,6 12: Kernel Ridge Regression; Support Vector Machinewhere αi> 0 if yif(xi) = 1 and αi= 0 if yif(xi) > 1.In general, the training data that are closed to the hyperplane, i.e., the vectors in the set{x : yif(xi) = 1}are called support vectors. By this definition, we can rewrite the solution ω asω =Xsupport vectorsαi· yi· xi.From this, it’s clear to see that if removing any points except for the support vectors, thesolution won’t change.For prediction, we can write f(x) as the following:f(x) = ωTx =nXi=1αiyi· xTix .Similarly, for nonlinear case,f(x) =nXi=1αiyi· φ(xi)Tφ(x) =nXi=1αiyi· κ(xi, x) .Since most of the αi’s are 0, we only need to consider the support vectors.Bibliography[1]Introduction to support vector machines @ONLINE.http://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html, 2017.[2] K Murphy. Machine Learning: A Probabilistic Perspactive. MIT Press,


View Full Document

ILLINOIS CS 446 - 100517.2

Download 100517.2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view 100517.2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view 100517.2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?