STEVENS CS 559 - CS 559 Machine Learning Fundamentals and Applications 9th Set of Notes

Unformatted text preview:

1CS 559: Machine Learning Fundamentals and Applications9thSet of NotesInstructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiE-mail: [email protected]: Lieb 2151Outline • Support Vector Machines (SVM)– Introduction– Linear Discriminant• Linearly Separable Case• Linearly Non Separable Case– Kernel Trick• Non Linear Discriminant• DHS Chapter 5• Slides based on Olga Veksler’sPattern Classification, Chapter 5 2SVMs• One of the most important developments in pattern recognition in the last 10 years• Elegant theory– Has good generalization properties• Have been applied to diverse problems very successfullyPattern Classification, Chapter 5 32Linear Discriminant Functions• A discriminant function is linear if it can be written as• which separating hyperplane should we choose?Pattern Classification, Chapter 5 4Linear Discriminant Functions• Training data is just a subset of all possible data– Suppose hyperplane is close to sample xi– If we see new sample close to xi, it may be on the wrong side of the hyperplane• Poor generalization (performance on unseen data)Pattern Classification, Chapter 5 5Linear Discriminant Functions• Hyperplane as far as possible from any sample• New samples close to the old samples will be classified correctly• Good generalizationPattern Classification, Chapter 5 63SVM• Idea: maximize distance to the closest example• For the optimal hyperplane– distance to the closest negative example = distance to the closest positive examplePattern Classification, Chapter 5 7SVM: Linearly Separable Case• SVM: maximize the margin• The marginis twice the absolute value of distance b of the closest example to the separating hyperplane• Better generalization (performance on test data)– in practice– and in theoryPattern Classification, Chapter 5 8SVM: Linearly Separable Case•Support vectors are the samples closest to the separating hyperplane– They are the most difficult patterns to classify• Optimal hyperplane is completely defined by support vectors– Of course, we do not know which samples are support vectors without finding the optimal hyperplanePattern Classification, Chapter 5 94SVM: Formula for the Margin• Absolute distance between x and the boundary g(x) = 0• Distance is unchanged for hyperplane• Let xibe an example closest to the boundary. Set• Now the largest margin hyperplane is uniquePattern Classification, Chapter 5 10SVM: Formula for the Margin• For uniqueness, set |wTxi+w0|=1for any sample xiclosest to the boundary• The distance from closest sample xitog(x) = 0is• Thus the margin isPattern Classification, Chapter 5 11SVM: Optimal Hyperplane• Maximize margin• Subject to constraints• Let • Can convert our problem to minimize•J(w) is a quadratic function, thus there is a single global minimumPattern Classification, Chapter 5 125SVM: Optimal Hyperplane• Use Kuhn-Tucker theorem to convert our problem to:•a ={a1,…, an} are new variables, one for each sample• Optimized by quadratic programmingPattern Classification, Chapter 5 13SVM: Optimal Hyperplane• After finding the optimal a= {a1,…, an}• Final discriminant function:• where S is the set of support vectorsPattern Classification, Chapter 5 14SVM: Optimal Hyperplane•LD(a) depends on the number of samples, not on dimension – samples appear only through the dot products xjtxi• This will become important when looking for a nonlinear discriminant function, as we will see soonPattern Classification, Chapter 5 156SVM: Non-Separable Case• Data is most likely to be not linearly separable, but linear classifier may still be appropriate• Can apply SVM in non linearly separable case• Data should be “almost” linearly separable for good performancePattern Classification, Chapter 5 16SVM: Non-Separable Case• Use slack variables ξ1,…, ξn(one for each sample)• Change constraints from to•ξiis a measure of deviationfrom the ideal for xi–ξi>1:xiis on the wrong side of the separating hyperplane–0 < ξi<1:xiis on the right side of separating hyperplane but within the region of maximum margin–ξi<0 : is the ideal case forxiPattern Classification, Chapter 5 17Nonlinear Mapping• Cover’s theorem: “a pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space”• One dimensional space, not linearly separable• Lift to two dimensional space with φ(x)=(x,x2 )Pattern Classification, Chapter 5 187Nonlinear Mapping• To solve a non linear classification problem with a linear classifier1. Project data x to high dimension using function φ(x)2. Find a linear discriminant function for transformed data φ(x)3. Final nonlinear discriminant function is g(x) = wtφ(x) +w0• In 2D, the discriminant function is linear• In 1D, the discriminant function is not linearPattern Classification, Chapter 5 19Nonlinear MappingPattern Classification, Chapter 5 20Nonlinear SVM• Can use any linear classifier after lifting data to a higher dimensional space. However we will have to deal with the curse of dimensionality– Poor generalization to test data– Computationally expensive• SVM avoids the curse of dimensionality problems– Enforcing largest margin permits good generalization• It can be shown that generalization in SVM is a function of the margin, independent of the dimensionality– Computation in the higher dimensional case is performed only implicitly through the use of kernel functionsPattern Classification, Chapter 5 218Kernels• SVM optimization: maximize • Note this optimization depends on samples xi only through the dot product xitxj• If we lift xito high dimension using φ(x),need to compute high dimensional product φ(xi)tφ(xj)maximize • Idea: find kernel function K(xi,xj) s.t.Pattern Classification, Chapter 5 22Kernel Trick• Then we only need to compute K(xi,xj) instead of φ(xi)tφ(xj)• “kernel trick”: do not need to perform operations in high dimensional space explicitlyPattern Classification, Chapter 5 23Kernel Example• Suppose we have two features and K(x,y) = (xty)2• Which mapping φ(x) does this correspond to?Pattern Classification, Chapter 5 249Choice of Kernel• How to choose kernel function K(xi,xj)?–K(xi,xj) should correspond to φ(xi)tφ(xj)in a higher dimensional space– Mercer’s condition tells us which kernel function can be expressed


View Full Document

STEVENS CS 559 - CS 559 Machine Learning Fundamentals and Applications 9th Set of Notes

Download CS 559 Machine Learning Fundamentals and Applications 9th Set of Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 559 Machine Learning Fundamentals and Applications 9th Set of Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 559 Machine Learning Fundamentals and Applications 9th Set of Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?