Solving optimal margin classifier z Recall our opt problem max w b s t 1 w yi wT xi b 1 i z This is equivalent to z 1 T w w 2 s t 1 yi wT xi b 0 i Write the Lagrangian min w b L w b z m 1 T w w i yi wT xi b 1 2 i 1 Recall that can be reformulated as min w b max i 0 L w b Now we solve its dual problem max i 0 min w b L w b 1 The Dual Problem max i 0 min w b L w b z We minimize L with respect to w and b first m w L w b w i yi xi 0 i 1 m b L w b i yi 0 i 1 m w i yi xi Note that implies i 1 z Plus back to L and using we have m L w b i i 1 1 m i j yi y j xTi x j 2 i j 1 The Dual problem cont z Now we have the following dual opt problem m max J i i 1 s t i 0 i 1 K k m y i 1 z 1 m i j yi y j xTi x j 2 i j 1 i i 0 This is again a quadratic programming problem z A global maximum of i can always be found z But what s the big deal z Note two things 1 w can be recovered by m w i yi x i See next xTi x j More later i 1 2 The kernel 2 Support vectors z Note the KKT condition only a few i s can be nonzero i g i w 0 i 1 K k Class 2 8 0 6 Call the training data points whose i s are nonzero the support vectors SV 10 0 7 0 5 0 4 0 9 0 Class 1 2 0 1 0 8 6 1 4 3 0 Support vector machines z Once we have the Lagrange multipliers i we can reconstruct the parameter vector w as a weighted combination of the training examples w y x i SV z i i i For testing with a new data z z Compute wT z b y x z b i SV i i T i and classify z as class 1 if the sum is positive and class 2 otherwise z Note w need not be formed explicitly 3 Interpretation of support vector machines z z z The optimal w is a linear combination of a small number of data points This sparse representation can be viewed as data compression as in the construction of kNN classifier To compute the weights i and to use support vector machines we need to specify only the inner products or T kernel between the examples x i x j We make decisions by comparing each new example z with only the support vectors y sign i yi xTi z b i SV Non linearly Separable Problems Class 2 Class 1 z z We allow error i in classification it is based on the output of the discriminant function wTx b i approximates the number of misclassified samples 4 Soft Margin Hyperplane z Now we have a slightly different opt problem min w b s t m 1 T w w C i 2 i 1 yi wT xi b 1 i i i 0 i z i are slack variables in optimization z Note that i 0 if there is no error for xi z i is an upper bound of the number of errors z C tradeoff parameter between error and margin The Optimization Problem z The dual of this new constrained optimization problem is m max s t J i i 1 0 i C i 1 K k m y i 1 z z 1 m i j yi y j xTi x j 2 i j 1 i i 0 This is very similar to the optimization problem in the linear separable case except that there is an upper bound C on i now Once again a QP solver can be used to find i 5 Extension to Non linear Decision Boundary z So far we have only considered large margin classifier with a linear decision boundary z How to generalize it to become nonlinear z Key idea transform xi to a higher dimensional space to make life easier z Input space the space the point xi are located z Feature space the space of xi after transformation Why transform z z Linear operation in the feature space is equivalent to non linear operation in input space z Classification can become easier with a proper transformation Transforming the Data Feature space Input space Note feature space is of higher dimension than the input space in practice z Computation in the feature space can be costly because it is high dimensional z z The feature space is typically infinite dimensional The kernel trick comes to rescue 6 The Kernel Trick z Recall the SVM optimization problem m max s t J i i 1 1 m i j yi y j xTi x j 2 i j 1 0 i C i 1 K k m y i 1 i i 0 z The data points only appear as inner product z As long as we can calculate the inner product in the feature space we do not need the mapping explicitly z Many common geometric operations angles distances can be expressed by inner products z Define the kernel function K by K x i x j x i T x j An Example for feature mapping and kernels z z z Consider an input x x1 x2 Suppose is given as follows x 1 1 2 x1 2 x2 x12 x22 2 x1 x2 x2 An inner product in the feature space is x x 1 1 x x 2 z 2 So if we define the kernel function as follows there is no need to carry out explicitly K x x 1 xT x 2 7 More examples of kernel functions z Linear kernel we ve seen it K x x xT x z Polynomial kernel we just saw an example K x x 1 xT x p where p 2 3 To get the feature vectors we concatenate all pth order polynomial terms of the components of x weighted appropriately z Radial basis kernel 2 1 K x x exp x x 2 In this case the feature space consists of functions and results in a nonparametric classifier Kernelized SVM z Training m max s t J i i 1 0 i C i 1 K k m y i 1 z 1 m i j yi y j K xi x j 2 i j 1 i i 0 Using y sign i yi K x i z b i SV 8 SVM examples Examples for Non Linear SVMs Gaussian Kernel 9 Cross validation error z The leave one out cross validation error does not …
View Full Document