DOC PREVIEW
MIT 9 520 - Several Views of Support Vector Machines

This preview shows page 1-2-3-22-23-24-44-45-46 out of 46 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 46 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Several Views of Support Vector MachinesRyan M. RifkinHonda Research Institute USA, Inc.Human Intention Understanding Group2007R. Rifkin Fenchel Duality ITikhonov RegularizationWe are considering algorithms of the formminf ∈HnXi=1vi(Yi) +λ2||f ||2K.(1)Different loss functions lead to different learning problems.Last class, we discussed regularized least squares, bychoosingvi(yi) =12(Yi− yi)2.Support vector machines are another Tikhonovregularization algorithm . . .R. Rifkin Fenchel Duality ISVM Motivation: Problems with RLSRLS uses the square loss, which some might say does not“make sense” for classification. SVM uses the hinge loss(defined soon), which does “makes sense.”Nonlinear RLS does not scale easily to large data sets.The SVM can have better scaling properties.The SVM has a (in my opinion weak) geometric motivation:the idea of margin.R. Rifkin Fenchel Duality IA loss function for classificationThe most natural loss for classification is probably the 0-1loss: We pay zero if our prediction has the correct sign,and one otherwise (remember that functions in an RKHSmake real-valued predictions).Unfortunately, the 0-1 loss is not convex. Therefore, wehave little hope of being able to optimize this loss functionin practice. (Note that the representer theorem does holdfor the 0-1 loss.)A solution: the hinge loss, a convex loss that upper boundsthe zero-one loss:v(y) = max(1 − yY , 0)= (1 − yY )+.R. Rifkin Fenchel Duality IThe hinge loss−3 −2 −1 0 1 2 300.511.522.533.54y * f(x)Hinge LossR. Rifkin Fenchel Duality IValue-based SVMSubstituting the loss function into the definition of Tikhonovregularization, we get an optimization problemminy∈RnXi(1 − yiYi)++ λytK−1y.This is (basically) an SVM. So what?How will you solve this problem (find the minimizing y )?The hinge loss is not differentiable, so you cannot take thederivative and set it to zero.R. Rifkin Fenchel Duality ICoefficient-based SVMRemember that the representer theorem says the answerhas the formf (·) =Xicik(Xi, ·).Using the transformation y = Kc (or c = K−1y), we canrewrite the SVM asminc∈RnXi(1 − (Kc)i)++ λctKc.Again: so what?R. Rifkin Fenchel Duality IThe SVM: So What?The SVM has many interesting and desirable properties.These properties are not immediately apparent from theoptimization problems we have just written.Optimization theory and geometry lead us to algorithms forsolving the problem and insights in the nature of thesolution.We will see that SVMs have a nice sparsity property: many(frequently most) of the ci’s turn out to be zero.R. Rifkin Fenchel Duality INondifferentiable Functions and ConstraintsWe can rewrite a piecewise differentiable convex linearfunction as a sum of differentiable functions overconstrained variables.Case in point: instead of minimizing(1 − yY )+,I can minimizeξsubject to the constraints thatξ ≥ 1 − yY and ξ ≥ 0.Two different ways of looking at the same thing.R. Rifkin Fenchel Duality IThe Hinge Loss, Constrained FormIf I want to take a Lagrangian, I need to rewrite the lossfunction in terms of constraints. These constraints are alsocalled slack variables.This rewriting is orthogonal to the issue of whether I thinkabout y or c.In terms of y, we rewriteminy∈RnXi(1 − yiYi)++ λytK−1y.asminy∈Rn,ξ∈RnPiξi+ λytK−1ysubject to : ξ ≥ (1 − yY )ξ ≥ 0R. Rifkin Fenchel Duality IIn terms of cIn terms of the c, the constrained version of the problem isminc∈Rn,ξ∈RnPiξi+ λctKcsubject to : ξ ≥ (1 − YKc)ξ ≥ 0Note how we get rid of the (1 − Kc)+by requiring that the ξare nonnegative.R. Rifkin Fenchel Duality ISolving an SVM, IWritten in terms of c or y (and ξ), we have a problemwhere we’re trying to minimize a convex quadratic functionsubject to linear constraints.In optimization theory, this is called a convex quadraticprogram.Algorithm I: Find or buy software that solves convexquadratic programs.This will work. However, this software generally needs towork with the matrix K . It will be slower than solving anRLS problem of the same size.As we will see, the SVM has special structure which leadsto good algorithms.R. Rifkin Fenchel Duality IThe geometric approachThe “traditional” approach to explaining the SVM is viaseparating hyperplanes and margin.Imagine the positive and negative examples are separableby a linear function (i.e.a hyperplane).Define the margin as the distance from the hyperplane tothe nearest example.Intuitively, larger margin will generalize better.R. Rifkin Fenchel Duality ILarge and Small Margin Hyperplanes(a) (b)R. Rifkin Fenchel Duality IClassification With HyperplanesDenote the hyperplane by w.f (x) = wtx.A separating hyperplane satisfies yi(wtxi) > 0 for theentire training set.We are considering homogeneous hyperplanes (i.e.,hyperplanes that pass through the origin.)Geometrically, when we draw the hyperplane, we aredrawing the set {x : wtx = 0}, and the vector w is normalto this set.R. Rifkin Fenchel Duality IMaximizing Margin, IGiven a separating hyperplane w, let xcbe a training pointclosest to w, and define xwto be the unique point in{x : wtx = 0} that is closest to x. (Both xcand xwdependon w.)Finding a maximum margin hyperplane is equivalent tofinding a w that maximizes ||xc− xw||.For some k (assume k > 0 for convenience),wtxc= kwtxw= 0=⇒ wt(xc− xw) = kR. Rifkin Fenchel Duality IMaximizing Margin, IINoting that the vector xc− xwis parallel to the normal vector w,k = wt(xc− xw) = ||w|| ||xc− xw||=⇒ ||xc− xw|| =k||w||R. Rifkin Fenchel Duality IMaximizing Margin, IIIk is a “nuisance” parameter; WLOG, we fix it to 1. (Scalinga hyperparameter by a positive constant changes k and||w||, but not xcor xw.)With k fixed, maximizing ||x − xw|| is equivalent tomaximizing1||w||, or minimizing ||w||, or minimizing ||w||2.The margin is now the distance between {x : wtx = 0} and{x : wtx = 1.}Fixing k is fixing the scale of the function.R. Rifkin Fenchel Duality IThe linear homogeneous separable SVMPhrased as an optimization problem, we haveminw∈Rn||w||2subject to : yiwtxi− 1 ≥ 0 i = 1, . . . , nNote that ||w||2is the RKHS norm of a linear function.We are minimizing the RKHS norm, subject to a “hard”loss.R. Rifkin Fenchel Duality IFrom hard loss to hinge loss.We can introduce slacks ξi:minw∈Rn,ξ∈Rn||w||2+Piξisubject to : ξi≥ 1 − yiwtxii = 1, . . . , nξi≥ 0 i = 1, . . . , nWhat happened to our beautiful geometric argument?What is the margin if we


View Full Document

MIT 9 520 - Several Views of Support Vector Machines

Documents in this Course
Load more
Download Several Views of Support Vector Machines
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Several Views of Support Vector Machines and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Several Views of Support Vector Machines 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?