Support Vector Machines and Margin Classifiers 1 Announcements The Autolab experiment phase 1 complete lets do some polls I swapped the SVM and logistic regression classifier lectures Logistic regression classifier HW going out today Readings for Mon and today finally posted Next week we start listening to Eric 2 Outline for today Review of perceptron Because SVMs are like perceptrons except different 3 REVIEW OF THE PERCEPTRON 4 History perceptrons and SVMs Perceptron c 1957 Rosenblatt finished high school in 1946 prototyped s w in 1957 h w in 1960 at Cornell A handsome bachelor he drove a classic MGA sports car and was often seen with his cat named Tobermory Wikipedia analysis I presented last week dates from 1962 hugely influential until limitations started to be charted in 1969 in Minksy and Papert s book Perceptrons then largely forgotten Machine learning from 80 s on Occam s razor curse of dimensionality we knew that learning was hard with few examples and many features Support vector machines mid 1990 s Vladimir Vapnik at AT T Holmdel not well understood or widely used till late 1990 s experimental results better explanation of margins robust implementations e g Cortez Joachims margin result for perceptrons revived again by Freund and Schapire late 1990 s after they worked out a margin based analysis of boosting margins and kernels had huge impact on machine learning since 1990s 5 6 7 The perceptron Rosenblatt 1957 A instance xi B yi Compute yi sign vk xi If mistake vk 1 vk yi xi yi On line setting Adversary A provides student B with an instance x Student B predicts a class 1 1 according to a simple linear classifier sign vk x Adversary gives student the answer 1 1 for that instance Will do a worst case analysis of the mistakes made by the student over any sequence of instances from the adversary that follow a few rules 8 The perceptron after two positive x i A instance xi B yi yi Key idea is margins Intuitively lots of u s correctly classify the data when the margin is large Precisely every x y pair pushes the classifier toward u by in terms of dot product with u Compute yi vk xi If mistake vk 1 vk yi xi Lemma 1 the dot product between vk and u increases with each mistake by at last i e k vk u kg Summary We showed that If exists a u with unit norm that has margin on examples in the seq x1 y1 x2 y2 Then the perceptron algorithm makes R2 2 mistakes on the sequence where R xi Independent of dimension of the data or classifier We don t know if this algorithm could be better 11 Optimization of the margin We showed that If exists a u with unit norm that has margin on examples in the seq x1 y1 x2 y2 Then the perceptron algorithm makes R 2 2 mistakes on the sequence where R xi Independent of dimension of the data or classifier We don t know if this algorithm could be better Idea margins lead to the results suppose we use optimization methods to find a w that has large margin in the data 12 FROM PERCEPTRON TO SVMS 13 Perceptrons vs SVMs A first cut at an optimization problem Given x1 y1 x2 y2 x3 y3 Find some w where w 2 1 and for all i w xi yi 14 Perceptrons vs SVMs Second cut don t ask the user to specify Given x1 y1 x2 y2 x3 y3 Find some w and such that w 1 and for all i w xi yi such that is as large as possible 15 Perceptrons vs SVMs Second cut continued Given x1 y1 x2 y2 x3 y3 We need this why Maximize under the constraint w 2 1 and Otherwise we could increase by multiplying w for all i w xi yi by c 1 Minimize w 2 under the constraint Turns out this is an equivalent optimization for all i w xi yi 1 problem 16 SVMs and optimization Third cut the SVM optimization problem Given x1 y1 x2 y2 x3 y3 Find arg min w 1 w 2 2 such that i yi w x i 1 objective function This is a constrained optimizatio n problem constraints Constrained optimization is tricky solving with off the shelf solvers in this form is doable but expansive 17 http cs229 stanford edu notes cs229 notes3 pd OPTIMIZATION FOR SVMS KEY IDEAS 18 Digression 1 Another implementation of a perceptron instance xi A B yi yi v0 0 Compute yi sign vk xi If mistake vk 1 vk yi xi vk x x P x N P N Compute yi sign x xi x P x xi x N If mistake and yi 0 P P xi If mistake and yi 0 N N xi x yi xi yi xi yi xi 1 1 2 2 k k 19 Digression 1 Another implementation of a perceptron A instance xi B yi yi This looks less efficient but remember P and N will be relatively small sets And it has some advantages everything is in P N Compute yi sign x xi x P x xi x N If mistake and yi 0 P P xi If mistake and yi 0 N N xi 20 Disgression 2 Constrained Optimization via Lagrange Multipliers Suppose we want to maximize f x y subject to a constraint g x y c At the maximal point moving along g in either direction should not increase f i e the contour line for f is tangent to g i e the gradients of y f and g are f x l g x parallel y x y x y 21 Disgression 2 Constrained Optimization via Lagrange Lagrange Multipliers multiplier Clever idea 1 define L x y a f x y a g x y c 2 look at points where x y a L x y a 0 Then a L x y a 0 g x y c x yL x y a 0 x y f x y a x yg x y 22 SVMs and optimization I m putting Can we use this idea for SVM optimization the Given x1 y1 x2 y2 x3 y3 Find 1 2 argmin w b 2 w such that i yi w xi b 1 objective function intercept term b back in This is a constrained optimizatio n problem constraints 1 2 Define a Lagrangian and find min w b L w b a1 a n w 2 a y w x b 1 i i i 2 max 0 of i Skipping a few steps here about 23 SVMs and optimization Define a Lagrangian find minw b max 0 of 1 2 L w b a1 a n w 2 a i yi w xi b 1 2 i Due to KKT conditions at minimum L w b a1 a n 0 and so a i yi 0 b i L w b a1 a n 0 and so w a i yi xi w i 24 What s a support …
View Full Document
Unlocking...