Support Vector Machines Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Support Vector Machines How can we decide between perfect classifiers 2 Support Vector Machines How can we decide between perfect classifiers 3 Support Vector Machines Define the margin to be the distance of the closest data point to the classifier 4 Support Vector Machines Support vector machines SVMs Choose the classifier with the largest margin Has good practical and theoretical performance 5 Some Geometry 0 0 6 In dimensions a hyperplane is a solution to the equation with The vector hyperplane is sometimes called the normal vector of the Some Geometry 0 In dimensions a hyperplane is a solution to the equation Note that this equation is scale invariant for any scalar 0 0 7 Some Geometry 0 The distance between a point is the length of the segment perpendicular to the line to the point and a hyperplane 0 The vector from to is given by 8 Scale Invariance 0 By scale invariance we can assume that The maximum margin is always attained by choosing 1 so that it is equidistant from the closest data point classified as 1 and the closest data point classified as 1 0 9 Scale Invariance 0 We want to maximize the margin subject to the constraints that But how do we compute the size of the margin 1 10 Some Geometry 1 0 1 Putting it all together and 1 0 and 1 which gives 1 11 SVMs This analysis yields the following optimization problem such that max 1 Or equivalently 1 for all such that 2 min 1 for all 12 SVMs such that 2 min This is a standard quadratic programming problem 1 for all Falls into the class of convex optimization problems Can be solved with many specialized optimization tools e g quadprog in MATLAB 13 SVMs 1 0 1 Where does the name come from The set of all data points such that called support vectors are 1 The SVM classifier is completely determined by the support vectors can delete the rest and get the same answer 14 SVMs What if the data isn t linearly separable What if we want to do more than just binary classification i e if 1 2 3 15 SVMs What if the data isn t linearly separable Use feature vectors Relax the constraints coming soon What if we want to do more than just binary classification i e if 1 2 3 16 Multiclass Classification 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 17 One Versus All SVMs 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 18 One Versus All SVMs 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 19 Regions correctly classified by exactly one classifier One Versus All SVMs Compute a classifier for each label versus the remaining labels i e and SVM with the selected label as plus and the remaining labels changed to minuses Let be the classifier for the label For a new datapoint classify it as Drawbacks argmax If there are possible labels requires learning classifiers over the entire data set 20 One Versus All SVMs 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 21 Regions in which points are classified by highest value of One Versus One SVMs Alternative strategy is to construct a classifier for all possible pairs of labels Given a new data point can classify it by majority vote i e find the most common label among all of the possible classifiers If there are labels requires computing different classifiers each of which uses only a fraction of the data 2 Drawbacks Can overfit if some pairs of labels do not have a significant amount of data plus it can be computationally expensive 22 One Versus One SVMs 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 23 Regions determined by majority vote over the classifiers
View Full Document