6 867 Machine learning lecture 8 Jaakkola 1 Lecture topics Support vector machine and kernels Kernel optimization selection Support vector machine revisited Our task here is to rst turn the support vector machine into its dual form where the exam ples only appear in inner products To this end assume we have mapped the examples into feature vectors x of dimension d and that the resulting training set x1 y1 xn yn is linearly separable Finding the maximum margin linear separator in the feature space now corresponds to solving minimize 2 2 subject to yt T xt 0 1 t 1 n 1 We will discuss later on how slack variables a ect the resulting kernel dual form They merely complicate the derivation without changing the procedure Optimization problems of the above type convex linear constraints can be turned into their dual form by means of Lagrange multipliers Speci cally we introduce a non negative scalar parameter t for each inequality constraint and cast the estimation problem in terms of and 1 n n 2 T J 0 2 t yt xt 0 1 2 t 1 The original minimization problem for and 0 is recovered by maximizing J 0 with respect to In other words J 0 max J 0 0 3 where 0 means that all the components t are non negative Let s try to see rst that J 0 really is equivalent to the original problem Suppose we set and 0 such that at least one of the constraints say the one corresponding to xi yi is violated In that case T i yi xi 0 1 0 4 for any i 0 We can then set i to obtain J 0 You can think of the Lagrange multipliers playing an adversarial role to enforce the margin constraints More Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 8 Jaakkola 2 formally J 0 2 2 if yt T xt 0 1 t 1 n otherwise 5 So the minimizing and 0 are therefore those that satisfy the constraints On the basis of a general set of criteria governing the optimality when dealing with Lagrange multipliers criteria known as Slater conditions we can actually switch the maximizing over and the minimization over 0 and get the same answer min max J 0 max min J 0 0 0 0 0 6 The left hand side equivalent to minimizing Eq 5 is known as the primal form while the right hand side is the dual form Let s solve the right hand side by rst obtaining and 0 as a function of the Lagrange multipliers and the data To this end n d J 0 t yt 0 d 0 t 1 7 n d J 0 t yt xt 0 d t 1 8 So again the solution for is in the span of the feature vectors corresponding to the training examples Substituting this form of the solution for back into the objective and taking into account the constraint corresponding to the optimal 0 we get J min J 0 9 0 n n n n T t 1 t 1 2 i 1 j 1 i j yi yj xi xj if t 1 t yt 0 10 otherwise The dual form of the solution is therefore obtained by maximizing n t 1 t 1 2 n n i j yi yj xi T xj 11 i 1 j 1 subject to t 0 n t yt 0 12 t 1 This is the dual or kernel form of the support vector machine and is also a quadratic optimization problem The constraints are simpler however Moreover the dimension of Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 8 Jaakkola 3 the input vectors does not appear explicitly as part of the optimization problem It is formulated solely on the basis of the Gram matrix x1 T x1 x1 T xn K 13 T T xn x1 xn xn We have already seen that the maximum margin hyperplane can be constructed on the basis of only a subset of the training examples This should also also in terms of the feature vectors How will this be manifested in the t s Many of them will be exactly zero due to the optimization In fact they are non zero only for examples feature vectors that are support vectors Once we have solved for t we can classify any new example according to the discriminant function y x T x 0 n t yt xt T x 0 14 15 t 1 t yt xt T x 0 16 t SV where SV is the set of support vectors corresponding to non zero values of t We don t know which examples feature vectors become as support vectors until we have solved the optimization problem Moreover the identity of the support vectors will depend on the feature mapping or the kernel function But what is 0 It appeared to drop out of the optimization problem We can set 0 after solving for t by looking at the support vectors Indeed for all i SV we should have yi T xi 0 yi t xt T xi yi 0 1 17 t SV from which we can easily solve for 0 In principle selecting any support vector would su ce but since we typically solve the quadratic program over t s only up to some resolution these constraints may not be satis ed with equality It is therefore advisable to construct 0 as the median value of the solutions implied by the support vectors What is the geometric margin we attain with some kernel function K x x x T x Cite as Tommi Jaakkola course materials for 6 867 Machine Learning Fall 2006 MIT OpenCourseWare http ocw mit edu Massachusetts Institute of Technology Downloaded on DD Month YYYY 6 867 Machine learning lecture 8 Jaakkola 4 It is still 1 In a kernel form n n 1 2 geom i j yi yj K xi xj 18 i 1 j 1 Would it make sense to compare geometric margins we attain with di erent kernels We could perhaps use it as a criterion for selecting the best kernel function Unfortunately this won t work without some care For example if we multiply all the feature vectors by 2 then the resulting geometric margin will also be twice as large we just expanded the space the relations between the points remain the same It is necessary to perform some normalization before any comparison makes sense We have so far assumed that the examples in their feature representations are linearly separable We d also like to have the kernel form of the relaxed support vector machine formulation 2 minimize 2 C n t 19 subject to yt T xt 0 1 t t 1 n 20 t 1 The resulting dual …
View Full Document
Unlocking...