11Machine LearningCS6375 --- Spring 2015aSupport Vector Machines (II)2The Dual Problem∑∑ ∑∑== ===≥=−=RiiiijijiijijjRiRjiRiiyxxyyGGaaaL11 110,0:subject to'21 MaximizeααSVfor 0≠=∑iiiiixywαα23But what if the data is not linearly separable?Idea 1: Find minimum w.w, whileminimizing number oftraining set errors.Problematic: Two thingsto minimize makes foran ill-defined optimization4Idea 1.1:Minimizew.w + C (#train errors)There’s a practical problem with this approachIt doesn’t distinguish between disastrous errors and near misses.But what if the data is not linearly separable?35Idea 1.2:Minimizew.w + C (distance of errorpoints to theircorrect place)But what if the data is not linearly separable?6Learning the Maximum Margin ClassifierGiven guess of w , b we can• Compute sum of distances of points to their correct zones• Compute the margin widthAssume R data points, each(xk,yk) where yk= +/- 1What should our quadraticoptimization criterion be?How many constraints will wehave?What should they be?47Learning the Soft Margin ClassifierGiven guess of w , b we can• Compute sum of distances of points to their correct zones• Compute the margin widthAssume R datapoints, each(xk,yk) where yk= +/- 1What should our quadraticoptimization criterion be?MinimizeHow many constraints will wehave? 2RWhat should they be?w . xk+ b >= 1-εkif yk= 1w . xk+ b <= -1+εkif yk= -1εk>= 0 for all k8Learning the Soft Margin ClassifierGiven guess of w , b we can• Compute sum of distances of points to their correct zones• Compute the margin widthAssume R datapoints, each(xk,yk) where yk= +/- 1What should our quadraticoptimization criterion be?MinimizeHow many constraints will wehave? 2RWhat should they be?w . xk+ b >= 1-εkif yk= 1w . xk+ b <= -1+εkif yk= -1εk>= 0 for all kOur original (noiseless data) QP had m+1 variables: w1, w2, … wm, and b.Our new (noisy data) QP has m+1+Rvariables: w1, w2, …, wm, b, ε1, …, εR59Controlling Soft Margin SeparationC controls trade-off between margin and error.10The Dual QPData points with αk> 0 will be the support vectors, so this sum only needs to be over the support vectors.611Soft Margin Dual ProblemOne factor αifor each training example0 < αi< C SV with ξi = 0αi= C SV with ξi > 0αi= 0 otherwise12SVM Training and TestingRemember: dot product of instances in both training and testingTraining:Testing: jijiijijjRiRjiRiixxyyGGaaaL'211 11=−=∑ ∑∑= ==bxxybxwSVxiiii+•=+•∑∈**α713Example: suppose we are in 1-dimensionWhat would SVMs do with this data?Not a big surprise.14Harder 1-dimensional DatasetWhat can be done about this?Idea 1: use soft margin, allow some errors815Harder 1-dimensional DatasetLet’s project the data points to a 2D space using non-linearbasis functions.16Harder 1-dimensional DatasetLet’s project the data points to a 2D space using non-linearbasis functions.Now the data is linearly separable.917Learning in the Feature SpaceMap data into a feature space where they are linearly separable x->Φ(x) 18Non-Linear SVMs: Feature SpacesGeneral idea: the original feature space can be mapped to other feature space (likely higher dimensional) where the training set is separable:Φ1019QP with Basis Functionskkkktskkkkxwxybxywk SVfor )()1()(0..•Φ−−=Φ=∑>εαα))((),,( bxwsignbwxf+Φ•=Classify with: This is sensible. Is that the end of the story? No…there’s one more trick!20Quadratic Basis FunctionsNumber of terms (assuming m inputdimensions) = m2/2What are those √2 ’s doing?•You should be happy they do no harm.•You’ll find out why they’re there soon.1121QP with Basis Functionskkkktskkkkxwxybxywk SVfor )()1()(0..•Φ−−=Φ=∑>εαα))((),,( bxwsignbwxf+Φ•=Classify with: We must do R2/2 dot products toget this matrix ready.Each dot product requires m2/2additions and multiplications.The whole thing costs R2m2/4.This is computationally expensive!22Quadratic Dot Products1223Quadratic Dot ProductsLet’s look at another function of a and b:They’re the same!What’s the complexity of each? m vs. m224QP with Basis Functionskkkktskkkkxwxybxywk SVfor )()1()(0..•Φ−−=Φ=∑>εαα))((),,( bxwsignbwxf+Φ•=Classify with: We must do R2/2 dot products toget this matrix ready.Each dot product now only requiresm additions and multiplications(The Kernel Trick))()()(lklkxxKxx•=Φ•ΦCalled kernel/gram matrix1325Higher Order Polynomials26QP with Basis FunctionsWhat about testing? Expensive in the new feature space?))((),,( bxwsignbwxf+Φ•=Example: 5-degree polynomial1427Overfitting? Curse of dimensionality?• Mapping to high dimensional space x -> Ф(x)• Are we overfitting the training data? • The use of Maximum Margin magically makes this not a problem• No matter what the basis function is, there are only up to R parameters: α1, α2, …, αR, and usually most are set to zero by the Maximum Margin.28SVM Kernel Functions• K(a,b)=(a . b +1)dis an example of an SVM Kernel Function• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function• Radial-Basis-style Kernel Function:• Neural-net-style Kernel Function:• String kernel, tree kernel, ….1529KernelsA function that returns the value of the dot product between the images of the two argumentsGiven a function K, it is possible to verify that it is a kernelCan construct kernels from simple ones30SVM Performance• Anecdotally they work very very well indeed.• A lot of empirical and theoretical studies.• Can use ensemble ideas to do multi-class classification1631What you should know• Linear SVMs• The definition of a maximum margin classifier• What QP can do for you (for this class, you don’t need to know how it does it)• How Maximum Margin can be turned into a QP problem• How we deal with noisy (non-separable) data• How we permit non-linear boundaries• How SVM Kernel functions permit us to pretend we’re working with ultra-high-dimensional basis function
View Full Document