MIT 9 520 - Support Vector Machines - D2687005

Home> Schools> Massachusetts Institute of Technology> Brain and Cognitive Sciences (9) > 9 520> Support Vector Machines

DOC PREVIEW

MIT 9 520 - Support Vector Machines

School name Massachusetts Institute of Technology

Course 9 520- Statistical Learning Theory and Applications

Pages 45

This preview shows page 1-2-3-21-22-23-43-44-45 out of 45 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Support Vector MachinesRyan M. RifkinGoogle, Inc.2008R. Rifkin Support Vector MachinesPlanRegularization derivation of SVMsGeometric derivation of SVMsOptimality, Duality and Large Scale SVMsR. Rifkin Support Vector MachinesThe Regularization Setting (Again)We are given ℓ examples (x1, y1), . . . , (xl, yl), with xi∈ Rnandyi∈ {−1, 1} for all i. As mentioned last class, we can find aclassification function by solving a regularized learning problem:minf∈H1ℓℓXi=1V(yi, f(xi)) + λ||f||2K.Note that in this class we are specifically consider binaryclassification.R. Rifkin Support Vector MachinesThe Hinge LossThe classical SVM arises by considering the specific lossfunctionV(f(x, y)) ≡ (1 − yf(x))+,where(k)+≡ max(k, 0).R. Rifkin Support Vector MachinesThe Hinge Loss−3 −2 −1 0 1 2 300.511.522.533.54y * f(x)Hinge LossR. Rifkin Support Vector MachinesSubstituting In The Hinge LossWith the hinge loss, our regularization problem becomesminf∈H1ℓℓXi=1(1 − yif(xi))++ λ||f||2K.Note that we don’t have a12multiplier on the regularizationterm.R. Rifkin Support Vector MachinesSlack VariablesThis problem is non-differentiable (because of the “kink” in V),so we introduce slack variables ξi, to make the problem easierto work with:minf∈H1ℓPℓi=1ξi+ λ||f||2Ksubject to : yif(xi) ≥ 1 − ξii = 1, . . . , ℓξi≥ 0 i = 1, . . . , ℓR. Rifkin Support Vector MachinesApplying The Representer TheoremSubstituting in:f∗(x) =ℓXi=1ciK(x, xi),we arrive at a constrained quadratic programming problem:minc∈Rℓ,ξ∈Rℓ1ℓPℓi=1ξi+ λcTKcsubject to : yiPℓj=1cjK(xi, xj) ≥ 1 − ξii = 1, . . . , ℓξi≥ 0 i = 1, . . . , ℓR. Rifkin Support Vector MachinesAdding A Bias TermIf we add an unregularized bias term b, which presents sometheoretical difficulties to be discussed later, we arrive at the“primal” SVM:minc∈Rℓ,b∈R,ξ∈Rℓ1ℓPℓi=1ξi+ λcTKcsubject to : yi(Pℓj=1cjK(xi, xj) + b) ≥ 1 − ξii = 1, . . . , ℓξi≥ 0 i = 1, . . . , ℓR. Rifkin Support Vector MachinesForming the LagrangianFor reasons that will be clear in a few slides, we derive theWolfe dual quadratic program using Lagrange multipliertechniques:L(c, ξ, b, α, ζ) =1ℓℓXi=1ξi+ λcTKc−ℓXi=1αi(yi{ℓXj=1cjK(xi, xj) + b} − 1 + ξi)−ℓXi=1ζiξiWe want to minimize L with respect to c, b, and ξ, andmaximize L with respect to α and ζ, subject to the constraints ofthe primal problem and nonnegativity constraints on α and ζ.R. Rifkin Support Vector MachinesEliminating b and ξ∂L∂b= 0 =⇒ℓXi=1αiyi= 0∂L∂ξi= 0 =⇒1ℓ− αi− ζi= 0=⇒ 0 ≤ αi≤1ℓWe write a reduced Lagrangian in terms of the remainingvariables:LR(c, α) = λcTKc −ℓXi=1αi(yiℓXj=1cjK(xi, xj) − 1)R. Rifkin Support Vector MachinesEliminating cAssuming the K matrix is invertible,∂LR∂c= 0 =⇒ 2λKc − KYα = 0=⇒ ci=αiyi2λWhere Y is a diagonal matrix whose i’th diagonal element is yi;Yα is a vector whose i’th element is αiyi.R. Rifkin Support Vector MachinesThe Dual ProgramSubstituting in our expression for c, we are left with thefollowing “dual” program:maxα∈RℓPℓi=1αi−14λαTQαsubject to :Pℓi=1yiαi= 00 ≤ αi≤1ℓi = 1, . . . , ℓHere, Q is the matrix defined byQ = yKyT⇐⇒ Qij= yiyjK(xi, xj).R. Rifkin Support Vector MachinesStandard NotationIn most of the SVM literature, instead of the regularizationparameter λ, regularization is controlled via a parameter C,defined using the relationshipC =12λℓ.Using this definition (after multiplying our objective function bythe constant12λ, the basic regularization problem becomesminf∈HCℓXi=1V(yi, f(xi)) +12||f||2K.Like λ, the parameter C also controls the tradeoff betweenclassification accuracy and the norm of the function. The primaland dual problems become. . .R. Rifkin Support Vector MachinesThe Reparametrized Problemsminc∈Rℓ,b∈R,ξ∈RℓCPℓi=1ξi+12cTKcsubject to : yi(Pℓj=1cjK(xi, xj) + b) ≥ 1 − ξii = 1, . . . , ℓξi≥ 0 i = 1, . . . , ℓmaxα∈RℓPℓi=1αi−12αTQαsubject to :Pℓi=1yiαi= 0 i = 1, . . . , ℓ0 ≤ αi≤ C i = 1, . . . , ℓR. Rifkin Support Vector MachinesThe Geometric ApproachThe “traditional” approach to developing the mathematics ofSVM is to start with the concepts of separating hyperplanesand margin. The theory is usually developed in a linear space,beginning with the idea of a perceptron, a linear hyperplanethat separates the positive and the negative examples. Definingthe margin as the distance from the hyperplane to the nearestexample, the basic observation is that intuitively, we expect ahyperplane with larger margin to generalize better than onewith smaller margin.R. Rifkin Support Vector MachinesLarge and Small Margin Hyperplanes(a) (b)R. Rifkin Support Vector MachinesClassification With HyperplanesWe denote our hyperplane by w, and we will classify a newpoint x via the functionf(x) = sign (w · x). (1)Given a separating hyperplane w we let x be a datapointclosest to w, and we let xwbe the unique point on w that isclosest to x. Obviously, finding a maximum margin w isequivalent to maximizing ||x − xw||.. .R. Rifkin Support Vector MachinesDeriving the Maximal Margin, IFor some k (assume k > 0 for convenience),w · x = kw · xw= 0=⇒ w · (x − xw) = kR. Rifkin Support Vector MachinesDeriving the Maximal Margin, IINoting that the vector x − xwis parallel to the normal vector w,w · (x − xw) = w ·||x − xw||||w||w= ||w||2||x − xw||||w||= ||w|| ||x − xw||=⇒ ||w|| ||(x − xw)|| = k=⇒ ||x − xw|| =k||w||R. Rifkin Support Vector MachinesDeriving the Maximal Margin, IIIk is a “nuisance paramter”. WLOG, we fix k to 1, and see thatmaximizing ||x − xw|| is equivalent to maximizing1||w||, which inturn is equivalent to minimizing ||w|| or ||w||2. We can nowdefine the margin as the distance between the hyperplanesw · x = 0 and w · x = 1.R. Rifkin Support Vector MachinesThe Linear, Homogeneous, Separable SVMminw∈Rn||w||2subject to : yi(w · x) ≥ 1 i = 1, . . . , ℓR. Rifkin Support Vector MachinesBias and SlackThe SVM introduced by Vapnik includes an unregularized biasterm b, leading to classification via a function of the form:f(x) = sign (w · x + b).In practice, we want to work with datasets that are not linearlyseparable, so we introduce slacks ξi, just as before. We can stilldefine the margin as the distance between the hyperplanesw · x = 0 and w · x = 1, but this is no longer

View Full Document