DOC PREVIEW
UCI ICS 273A - Support Vector Machines

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Support Vector MachinesMax WellingDepartment of Computer ScienceUniversity of Toronto10 King’s College RoadToronto, M5S 3G5 [email protected] is a note to explain support vector machines.1 PreliminariesOur task is to predict whether a test sample belongs to one of two classes. We receivetraining examples of the form: {xi, yi}, i = 1, ..., n and xi∈ Rd, yi∈ {−1, +1}. We call{xi} the co-variates or input vectors and {yi} the response variables or labels.We consider a very simple example where the data are in fact linearly separable: i.e. Ican draw a straight line f(x) = wTx − b such that all cases with yi= −1 fall on oneside and have f(xi) < 0 and cases with yi= +1 fall on the other and have f(xi) > 0.Given that we have achieved that, we could classify new test cases according to the ruleytest= sign(xtest).However, typically there are infinitely many such hyper-planes obtained by small pertur-bations of a given solution. How do we choose between all these hyper-planes which thesolve the separation problem for our training data, but may have different performance onthe newly arriving test cases. For instance, we could choose to put the line very close tomembers of one particular class, say y = −1. Intuitively, when test cases arrive we willnot make many mistakes on cases that should be classified with y = +1, but we will makevery easily mistakes on the cases with y = −1 (for instance, imagine that a new batch oftest cases arrives which are small perturbations of the training data). A sensible thing thusseems to choose the separation line as far away from both y = −1 and y = +1 trainingcases as we can, i.e. right in the middle.Geometrically, the vector w is directed orthogonal to the line defined by wTx = b. Thiscan be understood as follows. First take b = 0. Now it is clear that all vectors, x, withvanishing inner product with w satisfy this equation, i.e. all vectors orthogonal to w satisfythis equation. Now translate the hyperplane away from the origin over a vector a. Theequation for the plane now becomes: (x − a)Tw = 0, i.e. we find that for the offsetb = aTw, which is the projection of a onto to the vector w. Without loss of generality wemay thus choose a perpendicular to the plane, in which case the length ||a|| = |b|/||w||represents the shortest, orthogonal distance between the origin and the hyperplane.We now define 2 more hyperplanes parallel to the separating hyperplane. They representthat planes that cut through the closest training examples on either side. We will call them“support hyper-planes” in the following, because the data-vectors they contain support theplane.We define the distance between the these hyperplanes and the separating hyperplane to bed+and d−respectively. The margin, γ, is defined to be d++ d−. Our goal is now to finda the separating hyperplane so that the margin is largest, while the separating hyperplane isequidistant from both.We can write the following equations for the support hyperplanes:wTx = b + δ (1)wTx = b − δ (2)We now note that we have over-parameterized the problem: if we scale w, b and δ bya constant factor α, the equations for x are still satisfied. To remove this ambiguity wewill require that δ = 1, this sets the scale of the problem, i.e. if we measure distance inmillimeters or meters.We can now also compute the values for d+= (||b +1|−|b||)/||w|| = 1/||w|| (this is onlytrue if b /∈ (−1, 0) since the origin doesn’t fall in between the hyperplanes in that case. Ifb ∈ (−1, 0) you should use d+= (||b + 1| + |b||)/||w|| = 1/||w||). Hence the margin isequal to twice that value: γ = 2/||w||.With the above definition of the support planes we can write down the following constraintthat any solution must satisfy,wTxi− b ≤ −1 ∀ yi= −1 (3)wTxi− b ≥ +1 ∀ yi= +1 (4)or in one equation,yi(wTxi− b) − 1 ≥ 0 (5)We now formulate the primal problem of the SVM:minimizew,b12||w||2subject to yi(wTxi− b) − 1 ≥ 0 ∀i (6)Thus, we maximize the margin, subject to the constraints that all training cases fall oneither side of the support hyper-planes. The data-cases that lie on the hyperplane are calledsupport vectors, since they support the hyper-planes and hence determine the solution tothe problem.The primal problem can be solved by a quadratic program. However, it is not ready to bekernelised, because its dependence is not only on inner products between data-vectors.Hence, we transform to the dual formulation by first writing the problem using a La-grangian,L(w, b, α) =12||w||2−NXi=1αi£yi(wTxi− b) − 1¤(7)The solution that minimizes the primal problem subject to the constraints is given byminwmaxαL(w, α), i.e. a saddle point problem. When the original objective-functionis convex, (and only then), we can interchange the minimization and maximization. Doingthat, we find that we can find the condition on w that must hold at the saddle point we aresolving for. This is done by taking derivatives wrt w and b and solving,w −Xiαiyixi= 0 ⇒ w∗=Xiαiyixi(8)Xiαiyi= 0 (9)Inserting this back into the Lagrangian we obtain what is known as the dual problem,maximize LD=NXi=1αi−12XijαiαjyiyjxTixjsubject toXiαiyi= 0 (10)αi≥ 0 ∀i (11)The dual formulation of the problem is also a quadratic program, but note that the numberof variables, αiin this problem is equal to the number of data-cases, N.The crucial point is however, that this problem only depends on xithrough the inner prod-uct xTixj. This is readily kernelised through the substitution xTixj→ k(xi, xj). This isa recurrent theme: the dual problem lends itself to kernelisation, while the primal problemdid not.The theory of duality guarantees that for convex problems, the dual problem will be con-cave, and moreover, that the unique solution of the primal problem corresponds tot theunique solution of the dual problem. In fact, we have: LP(w∗) = LD(α∗), i.e. the“duality-gap” is zero.Next we turn to the conditions that must necessarily hold at the saddle point and thus thesolution of the problem. These are called the KKT conditions (which stands for Karush-Kuhn-Tucker). These conditions are necessary in general, and sufficient for convex opti-mization problems. They can be derived from the primal problem by setting the derivativeswrt to w to zero. Also, the constraints themselves are part of these conditions and weneed that for inequality constraints the Lagrange multipliers are


View Full Document

UCI ICS 273A - Support Vector Machines

Documents in this Course
Load more
Download Support Vector Machines
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Support Vector Machines and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Support Vector Machines 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?