UCI ICS 273A - Support Vector Machines - D2900120

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> Support Vector Machines

DOC PREVIEW

UCI ICS 273A - Support Vector Machines

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Support Vector MachinesMax WellingDepartment of Computer ScienceUniversity of Toronto10 King’s College RoadToronto, M5S 3G5 [email protected] is a note to explain support vector machines.1 PreliminariesOur task is to predict whether a test sample belongs to one of two classes. We receivetraining examples of the form: {xi, yi}, i = 1, ..., n and xi∈ Rd, yi∈ {−1, +1}. We call{xi} the co-variates or input vectors and {yi} the response variables or labels.We consider a very simple example where the data are in fact linearly separable: i.e. Ican draw a straight line f(x) = wTx − b such that all cases with yi= −1 fall on oneside and have f(xi) < 0 and cases with yi= +1 fall on the other and have f(xi) > 0.Given that we have achieved that, we could classify new test cases according to the ruleytest= sign(xtest).However, typically there are infinitely many such hyper-planes obtained by small pertur-bations of a given solution. How do we choose between all these hyper-planes which thesolve the separation problem for our training data, but may have different performance onthe newly arriving test cases. For instance, we could choose to put the line very close tomembers of one particular class, say y = −1. Intuitively, when test cases arrive we willnot make many mistakes on cases that should be classified with y = +1, but we will makevery easily mistakes on the cases with y = −1 (for instance, imagine that a new batch oftest cases arrives which are small perturbations of the training data). A sensible thing thusseems to choose the separation line as far away from both y = −1 and y = +1 trainingcases as we can, i.e. right in the middle.Geometrically, the vector w is directed orthogonal to the line defined by wTx = b. Thiscan be understood as follows. First take b = 0. Now it is clear that all vectors, x, withvanishing inner product with w satisfy this equation, i.e. all vectors orthogonal to w satisfythis equation. Now translate the hyperplane away from the origin over a vector a. Theequation for the plane now becomes: (x − a)Tw = 0, i.e. we find that for the offsetb = aTw, which is the projection of a onto to the vector w. Without loss of generality wemay thus choose a perpendicular to the plane, in which case the length ||a|| = |b|/||w||represents the shortest, orthogonal distance between the origin and the hyperplane.We now define 2 more hyperplanes parallel to the separating hyperplane. They representthat planes that cut through the closest training examples on either side. We will call them“support hyper-planes” in the following, because the data-vectors they contain support theplane.We define the distance between the these hyperplanes and the separating hyperplane to bed+and d−respectively. The margin, γ, is defined to be d++ d−. Our goal is now to finda the separating hyperplane so that the margin is largest, while the separating hyperplane isequidistant from both.We can write the following equations for the support hyperplanes:wTx = b + δ (1)wTx = b − δ (2)We now note that we have over-parameterized the problem: if we scale w, b and δ bya constant factor α, the equations for x are still satisfied. To remove this ambiguity wewill require that δ = 1, this sets the scale of the problem, i.e. if we measure distance inmillimeters or meters.We can now also compute the values for d+= (||b +1|−|b||)/||w|| = 1/||w|| (this is onlytrue if b /∈ (−1, 0) since the origin doesn’t fall in between the hyperplanes in that case. Ifb ∈ (−1, 0) you should use d+= (||b + 1| + |b||)/||w|| = 1/||w||). Hence the margin isequal to twice that value: γ = 2/||w||.With the above definition of the support planes we can write down the following constraintthat any solution must satisfy,wTxi− b ≤ −1 ∀ yi= −1 (3)wTxi− b ≥ +1 ∀ yi= +1 (4)or in one equation,yi(wTxi− b) − 1 ≥ 0 (5)We now formulate the primal problem of the SVM:minimizew,b12||w||2subject to yi(wTxi− b) − 1 ≥ 0 ∀i (6)Thus, we maximize the margin, subject to the constraints that all training cases fall oneither side of the support hyper-planes. The data-cases that lie on the hyperplane are calledsupport vectors, since they support the hyper-planes and hence determine the solution tothe problem.The primal problem can be solved by a quadratic program. However, it is not ready to bekernelised, because its dependence is not only on inner products between data-vectors.Hence, we transform to the dual formulation by first writing the problem using a La-grangian,L(w, b, α) =12||w||2−NXi=1αi£yi(wTxi− b) − 1¤(7)The solution that minimizes the primal problem subject to the constraints is given byminwmaxαL(w, α), i.e. a saddle point problem. When the original objective-functionis convex, (and only then), we can interchange the minimization and maximization. Doingthat, we find that we can find the condition on w that must hold at the saddle point we aresolving for. This is done by taking derivatives wrt w and b and solving,w −Xiαiyixi= 0 ⇒ w∗=Xiαiyixi(8)Xiαiyi= 0 (9)Inserting this back into the Lagrangian we obtain what is known as the dual problem,maximize LD=NXi=1αi−12XijαiαjyiyjxTixjsubject toXiαiyi= 0 (10)αi≥ 0 ∀i (11)The dual formulation of the problem is also a quadratic program, but note that the numberof variables, αiin this problem is equal to the number of data-cases, N.The crucial point is however, that this problem only depends on xithrough the inner prod-uct xTixj. This is readily kernelised through the substitution xTixj→ k(xi, xj). This isa recurrent theme: the dual problem lends itself to kernelisation, while the primal problemdid not.The theory of duality guarantees that for convex problems, the dual problem will be con-cave, and moreover, that the unique solution of the primal problem corresponds tot theunique solution of the dual problem. In fact, we have: LP(w∗) = LD(α∗), i.e. the“duality-gap” is zero.Next we turn to the conditions that must necessarily hold at the saddle point and thus thesolution of the problem. These are called the KKT conditions (which stands for Karush-Kuhn-Tucker). These conditions are necessary in general, and sufficient for convex opti-mization problems. They can be derived from the primal problem by setting the derivativeswrt to w to zero. Also, the constraints themselves are part of these conditions and weneed that for inequality constraints the Lagrange multipliers are

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

UCI ICS 273A - Support Vector Machines

Sign up for free to view:

Please select your school