Support Vector Machines The University of Texas at Dallas What We have Learned So Far 1 2 3 4 5 6 7 Decision Trees Na ve Bayes Linear Regression Logistic Regression Perceptron Neural networks K Nearest Neighbors Which of the above are linear and which are not 1 6 and 7 are non linear o 2 is linear under certain restrictions Decision Surfaces Decision Tree Linear Functions g x wT x b Nonlinear Functions Neural nets Today Support Vector Machine SVM A classifier derived from statistical learning theory by Vapnik et al in 1992 SVM became famous when using images as input it gave accuracy comparable to neural network with hand designed features in a handwriting recognition task Currently SVM is widely used in object detection recognition content based image retrieval text recognition biometrics speech recognition etc Also used for regression will not cover today Chapter 5 1 5 2 5 3 5 11 5 4 in Bishop SVM tutorial start reading from Section 3 V Vapnik Outline Linear Discriminant Function Large Margin Linear Classifier Nonlinear SVM The Kernel Trick Demo of SVM Linear Discriminant Function or a Linear Classifier denotes 1 Given data and two classes learn a function of the form x2 wT x b 0 denotes 1 g x wT x b A hyper plane in the feature space Decide class 1 if g x 0 and class 1 otherwise wT x b 0 x1 Linear Discriminant Function denotes 1 How would you classify these points using a linear discriminant function in order to minimize the error rate denotes 1 x2 Infinite number of answers x1 Linear Discriminant Function denotes 1 How would you classify these points using a linear discriminant function in order to minimize the error rate denotes 1 x2 Infinite number of answers x1 Linear Discriminant Function denotes 1 How would you classify these points using a linear discriminant function in order to minimize the error rate denotes 1 x2 Infinite number of answers x1 Linear Discriminant Function denotes 1 How would you classify these points using a linear discriminant function in order to minimize the error rate Infinite number of answers Which one is the best denotes 1 x2 x1 Large Margin Linear Classifier denotes 1 The linear discriminant function classifier with the maximum margin is the best Margin is defined as the width that the boundary could be increased by before hitting a data point Why it is the best The larger the margin the better generalization Robust to outliers denotes 1 x2 safe zone Margin x1 Large Margin Linear Classifier denotes 1 Aim Learn a large margin classifier Given a set of data points define denotes 1 x2 safe zone Margin For yi 1 wT xi b 1 For yi 1 wT xi b 1 Give an algebraic expression for the width of the margin x1 Algebraic Expression for Width of a Margin safe zone Margin x1 Large Margin Linear Classifier denotes 1 denotes 1 Aim Learn a large margin x2 classifier Mathematical Formulation 2 w maximize Margin x x such that For yi 1 w xi b 1 T x For yi 1 wT xi b 1 Common theme in machine learning LEARNING IS OPTIMIZATION x1 Large Margin Linear Classifier denotes 1 denotes 1 Formulation 1 minimize w 2 x2 Margin x 2 such that For yi 1 wT xi b 1 x x For yi 1 wT xi b 1 x1 Large Margin Linear Classifier denotes 1 denotes 1 Formulation 1 minimize w 2 such that yi wT xi b 1 x2 Margin x 2 x x x1 Large Margin Linear Classifier Formulation 1 2 minimize w 2 such that yi wT xi b 1 This is a Quadratic programming problem with linear constraints o Off the shelf Software However we will convert it to Lagrangian dual in order to use the kernel trick Review Last Lecture The Lagrangian method is a general method for converting constrained optimization problems to unconstrained ones o unconstrained optimization problems are easier When we have equality constraints the transformation looks as follows Review Last Lecture Lagrangian function Same as solving the primal problem Let the optimal solution be Not the same as solving the following dual problem Let the optimal solution be Solving the Optimization Problem Quadratic programming with linear constraints 1 minimize w 2 s t 2 yi wT xi b 1 Lagrangian Function n 1 2 minimize Lp w b i w i yi wT xi b 1 2 i 1 s t i 0 Solving the Optimization Problem n 1 2 minimize Lp w b i w i yi wT xi b 1 2 i 1 s t Lp w Lp b i 0 n 0 w i yi xi i 1 n 0 y i 1 i i 0 Solving the Optimization Problem n 1 2 minimize Lp w b i w i yi wT xi b 1 2 i 1 s t i 0 Lagrangian Dual Problem n 1 n n maximize i i j yi y j xTi x j 2 i 1 j 1 i 1 s t i 0 and n y i 1 i i 0 Solving the Optimization Problem From the equations we can prove that KKT conditions x2 i yi w xi b 1 0 x T Thus only support vectors have i 0 x x The solution has the form n w i yi xi i 1 Support Vectors y x i SV i i i get b from yi wT xi b 1 0 where xi is support vector x1 Solving the Optimization Problem The linear discriminant function is g x w T x b T x i i x b i SV Notice it relies on a dot product between the test point x and the support vectors xi Also keep in mind that solving the optimization problem involved computing the dot products xiTxj between all pairs of training points Large Margin Linear Classifier denotes 1 What if data is not linear separable noisy data outliers etc Slack variables i can be added to allow misclassification of difficult or noisy data points denotes 1 x2 1 2 x1 Large Margin Linear Classifier Formulation n 1 2 minimize w C i 2 i 1 such that 1 minimize w 2 s t 2 yi wT xi b 1 yi wT xi b 1 i i 0 Without slack variables Parameter C can be viewed as a way to control over fitting Large Margin Linear Classifier Formulation Lagrangian Dual Problem n 1 n n maximize i i j yi y j xTi x j 2 i 1 j 1 i 1 such that 0 i C n y i 1 i i 0 Non linear SVMs Datasets that are linearly separable with noise work out great 0 But what are we going to do if the dataset is just too hard 0 x x Kernel Trick SVM Linear SVM Kernel Trick This slide is courtesy of www iro umontreal ca pift6080 documents papers svm tutorial ppt Kernel Trick Motivation Linear classifiers are well understood widely used and efficient How to use linear classifiers to build non linear ones Neural networks Construct …
View Full Document