Machine Learning 1010 701 15701 15 781 Fall 2006 Support Vector Machines Eric Xing Lecture 8 October 5 2006 Reading Chap 6 7 C B book Outline z Maximum margin classification z Constrained optimization z Lagrangian duality z Kernel trick z Non separable cases 1 What is a good Decision Boundary z Consider a binary classification task with y 1 labels not 0 1 as before z When the training examples are linearly separable we can set the parameters of a linear classifier so that all the training examples are classified correctly z Many decision boundaries z z Generative classifiers z Logistic regressions Class 2 Class 1 Are all decision boundaries equally good Examples of Bad Decision Boundaries z Why we may have such boundaries z Irregular distribution z Imbalanced training sizes z outliners 2 Classification and Margin z Parameterzing decision boundary z Let w denote a vector orthogonal to the decision boundary and b denote a scalar offset term then we can write the decision boundary as wT x b 0 z Class 2 Margin wTx b 0 for all x in class 2 wTx b for all x in class 1 0 Or more compactly wTxi b yi 0 m Class 1 The margin between two points m wTxi b wTxj b wT xi xj Maximum Margin Classification z The margin is z It make sense to set constrains on W z Here is our Maximum Margin Classification problem max w b m m wT xi x j s t yi wT xi b m i w 1 z Equivalently we can instead work on this max w b s t m w yi wT xi b m i 3 Maximum Margin Classification con d z The optimization problem max w b s t z z m w yi wT xi b m i But note that the magnitude of m merely scales w and b and does not change the classification boundary at all So we instead work on this cleaner problem 1 w max w b s t z yi wT xi b 1 i The solution to this leads to the famous Support Vector Machines believed by many to be the best off the shelf supervised learning algorithm Support vector machine z A convex quadratic programming problem with linear constrains 1 max w b w s t yi wT xi b 1 i z m m 1 w z The attained margin is now given by z Only a few of the classification constraints are relevant support vectors Constrained optimization z We can directly solve this using commercial quadratic programming QP code z But we want to take a more careful investigation of Lagrange duality and the solution of the above is its dual form deeper insight support vectors kernels more efficient algorithm 4 Lagrangian Duality z The Primal Problem min w Primal s t f w g i w 0 i 1 K k hi w 0 i 1 K l The generalized Lagrangian k l i 1 i 1 L w f w i g i w i hi w the s 0 and s are called the Lagarangian multipliers Lemma f w if w satisfies primal constraints max i 0 L w o w A re written Primal min w max i 0 L w Lagrangian Duality cont z Recall the Primal Problem min w max i 0 L w z The Dual Problem max i 0 min w L w z Theorem weak duality d max i 0 min w L w min w max i 0 L w p z Theorem strong duality Iff there exist a saddle point of L w we have d p 5 A sketch of strong and weak duality z Now ignoring h x for simplicity let s look at what s happening graphically in the duality theorems d max i 0 min w f w T g w min w max i 0 f w T g w p f w f w g w g w The KKT conditions z If there exists some saddle point of L then the saddle point satisfies the following Karush Kuhn Tucker KKT conditions L w 0 i 1 K n wi L w 0 i 1 K l i i g i w 0 i 1 K k g i w 0 i 1 K k i 0 i 1 K k z Theorem If w and satisfy the KKT condition then it is also a solution to the primal and the dual problems 6 Solving optimal margin classifier z Recall our opt problem max w b s t 1 w yi wT xi b 1 i z This is equivalent to z 1 T w w 2 s t 1 yi wT xi b 0 i Write the Lagrangian min w b L w b z m 1 T w w i yi wT xi b 1 2 i 1 L w b Now we solve its dual problem max 0 min w b L w b Recall that can be reformulated as min w b max i 0 i The Dual Problem max i 0 min w b L w b z We minimize L with respect to w and b first m w L w b w i yi xi 0 i 1 m b L w b i yi 0 i 1 m Note that implies w i yi xi i 1 z Plus back to L and using we have m L w b i i 1 1 m i j yi y j xTi x j 2 i j 1 7 The Dual problem cont z Now we have the following dual opt problem m max J i i 1 i 0 i 1 K k s t m y i 1 z 1 m i j yi y j xTi x j 2 i j 1 i i 0 This is again a quadratic programming problem z A global maximum of i can always be found z But what s the big deal z Note two things 1 m w can be recovered by w i yi x i See next xTi x j More later i 1 2 The kernel Support vectors z Note the KKT condition only a few i s can be nonzero i g i w 0 i 1 K k Class 2 8 0 6 Call the training data points whose i s are nonzero the support vectors SV 10 0 7 0 5 0 4 0 9 0 Class 1 2 0 1 0 8 6 1 4 3 0 8 Support vector machines z Once we have the Lagrange multipliers i we can reconstruct the parameter vector w as a weighted combination of the training examples w y x i SV z i i i For testing with a new data z z Compute wT z b y x z b i SV i i T i and classify z as class 1 if the sum is positive and class 2 otherwise z Note w need not be formed explicitly Interpretation of support vector machines z z z The optimal w is …
View Full Document