Machine Learning 1010 701 15701 15 781 Spring 2008 Support Vector Machines Eric Xing Lecture 8 February 11 2008 Reading Chap 6 7 C B book Outline z Maximum margin classification z Constrained optimization z Lagrangian duality z Kernel trick z Non separable cases 1 What is a good Decision Boundary z Consider a binary classification task with y 1 labels not 0 1 as before z When the training examples are linearly separable we can set the parameters of a linear classifier so that all the training examples are classified correctly z Many decision boundaries z z Generative classifiers z Logistic regressions Class 2 Class 1 Are all decision boundaries equally good Not All Decision Boundaries Are Equal z Why we may have such boundaries z Irregular distribution z Imbalanced training sizes z outliners 2 Classification and Margin z Parameterzing decision boundary z Let w denote a vector orthogonal to the decision boundary and b denote a scalar offset term then we can write the decision boundary as wT x b 0 Class 2 Class 1 d d Classification and Margin z Parameterzing decision boundary z Let w denote a vector orthogonal to the decision boundary and b denote a scalar offset term then we can write the decision boundary as wT x b 0 z Margin wTx b c for all x in class 2 c for all x in class 1 wTx b Class 2 Or more compactly wTxi b yi c Class 1 d d The margin between two points m d d 3 Maximum Margin Classification z The margin is 2c wT xi x j m w w z Here is our Maximum Margin Classification problem max w s t 2c w yi wT xi b c i Maximum Margin Classification con d z The optimization problem max w b s t z z yi wT xi b c i But note that the magnitude of c merely scales w and b and does not change the classification boundary at all why So we instead work on this cleaner problem max w b s t z c w 1 w yi wT xi b 1 i The solution to this leads to the famous Support Vector Machines believed by many to be the best off the shelf supervised learning algorithm 4 Support vector machine z A convex quadratic programming problem with linear constrains 1 max w b w s t yi wT xi b 1 i z d d 1 w z The attained margin is now given by z Only a few of the classification constraints are relevant support vectors Constrained optimization z We can directly solve this using commercial quadratic programming QP code z But we want to take a more careful investigation of Lagrange duality and the solution of the above is its dual form deeper insight support vectors kernels more efficient algorithm Lagrangian Duality z The Primal Problem min w Primal s t f w g i w 0 i 1 K k hi w 0 i 1 K l The generalized Lagrangian k l i 1 i 1 L w f w i g i w i hi w the s 0 and s are called the Lagarangian multipliers Lemma f w if w satisfies primal constraints max i 0 L w o w A re written Primal min w max i 0 L w 5 Lagrangian Duality cont z Recall the Primal Problem min w max i 0 L w z The Dual Problem max i 0 min w L w z Theorem weak duality d max i 0 min w L w min w max i 0 L w p z Theorem strong duality Iff there exist a saddle point of L w we have d p A sketch of strong and weak duality z Now ignoring h x for simplicity let s look at what s happening graphically in the duality theorems d max i 0 min w f w T g w min w max i 0 f w T g w p f w f w g w g w 6 The KKT conditions z If there exists some saddle point of L then the saddle point satisfies the following Karush Kuhn Tucker KKT conditions L w 0 i 1 K n wi L w 0 i 1 K l i i g i w 0 i 1 K k g i w 0 i 1 K k i 0 i 1 K k z Theorem If w and satisfy the KKT condition then it is also a solution to the primal and the dual problems Solving optimal margin classifier z Recall our opt problem max w b s t 1 w yi wT xi b 1 i z This is equivalent to z 1 T w w 2 s t 1 yi wT xi b 0 i Write the Lagrangian min w b L w b z m 1 T w w i yi wT xi b 1 2 i 1 Recall that can be reformulated as min w b max i 0 L w b Now we solve its dual problem max i 0 min w b L w b 7 The Dual Problem max i 0 min w b L w b z We minimize L with respect to w and b first m w L w b w i yi xi 0 i 1 m b L w b i yi 0 i 1 m w i yi xi Note that implies i 1 z Plus back to L and using we have m L w b i i 1 1 m i j yi y j xTi x j 2 i j 1 The Dual problem cont z Now we have the following dual opt problem m max J i i 1 s t i 0 i 1 K k m y i 1 z 1 m i j yi y j xTi x j 2 i j 1 i i 0 This is again a quadratic programming problem z A global maximum of i can always be found z But what s the big deal z Note two things 1 w can be recovered by m w i yi x i See next xTi x j More later i 1 2 The kernel 8 Support vectors z Note the KKT condition only a few i s can be nonzero i g i w 0 i 1 K k Class 2 8 0 6 Call the training data points whose i s are nonzero the support vectors SV 10 0 7 0 5 0 4 0 9 0 Class 1 2 0 1 0 8 6 1 4 3 0 Support vector machines z Once we have the Lagrange multipliers i we can reconstruct the parameter vector w as a weighted combination of the training examples w y x i SV z i i i For testing with a new data z z Compute wT z b y x z b i SV i i T i and classify z as class 1 if the sum is positive and class 2 otherwise z Note w need not be formed explicitly …
View Full Document