A Simple Introduction to Support Vector MachinesOutlineHistory of SVMWhat is a good Decision Boundary?Examples of Bad Decision BoundariesLarge-margin Decision BoundaryFinding the Decision BoundaryRecap of Constrained OptimizationSlide 9Back to the Original ProblemThe Dual ProblemSlide 12Slide 13Characteristics of the SolutionThe Quadratic Programming ProblemA Geometrical InterpretationNon-linearly Separable ProblemsSoft Margin HyperplaneThe Optimization ProblemExtension to Non-linear Decision BoundaryTransforming the Data (c.f. DHS Ch. 5)The Kernel TrickAn Example for f(.) and K(.,.)Kernel FunctionsExamples of Kernel FunctionsModification Due to Kernel FunctionSlide 27More on Kernel FunctionsSlide 29ExampleSlide 31Slide 32Why SVM Work?Why SVM works?VC-dimensionSlide 36Structural Risk Minimization (SRM)Slide 38Slide 39Justification of SVMChoosing the Kernel FunctionOther Aspects of SVMSoftwareSummary: Steps for ClassificationStrengths and Weaknesses of SVMOther Types of Kernel MethodsConclusionResourcesA Simple Introduction to Support Vector MachinesMartin LawLecture for CSE 802Department of Computer Science and EngineeringMichigan State University01/14/19 CSE 802. Prepared by Martin Law2OutlineA brief history of SVMLarge-margin linear classifierLinear separableNonlinear separableCreating nonlinear classifiers: kernel trickA simple exampleDiscussion on SVMConclusion01/14/19 CSE 802. Prepared by Martin Law3History of SVMSVM is related to statistical learning theory [3]SVM was first introduced in 1992 [1] SVM becomes popular because of its success in handwritten digit recognition 1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4.See Section 5.11 in [2] or the discussion in [3] for detailsSVM is now regarded as an important example of “kernel methods”, one of the key area in machine learningNote: the meaning of “kernel” is different from the “kernel” function for Parzen windows[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992. [2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.[3] V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.01/14/19 CSE 802. Prepared by Martin Law4What is a good Decision Boundary?Consider a two-class, linearly separable classification problemMany decision boundaries!The Perceptron algorithm can be used to find such a boundaryDifferent algorithms have been proposed (DHS ch. 5)Are all decision boundaries equally good?Class 1Class 201/14/19 CSE 802. Prepared by Martin Law5Examples of Bad Decision BoundariesClass 1Class 2Class 1Class 201/14/19 CSE 802. Prepared by Martin Law6Large-margin Decision BoundaryThe decision boundary should be as far away from the data of both classes as possibleWe should maximize the margin, mDistance between the origin and the line wtx=k is k/||w||Class 1Class 2m01/14/19 CSE 802. Prepared by Martin Law7Finding the Decision BoundaryLet {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xiThe decision boundary should classify all points correctly The decision boundary can be found by solving the following constrained optimization problemThis is a constrained optimization problem. Solving it requires some new toolsFeel free to ignore the following several slides; what is important is the constrained optimization problem above01/14/19 CSE 802. Prepared by Martin Law8Recap of Constrained OptimizationSuppose we want to: minimize f(x) subject to g(x) = 0A necessary condition for x0 to be a solution: : the Lagrange multiplierFor multiple constraints gi(x) = 0, i=1, …, m, we need a Lagrange multiplier i for each of the constraints01/14/19 CSE 802. Prepared by Martin Law9Recap of Constrained OptimizationThe case for inequality constraint gi(x)0 is similar, except that the Lagrange multiplier i should be positiveIf x0 is a solution to the constrained optimization problemThere must exist i0 for i=1, …, m such that x0 satisfyThe function is also known as the Lagrangrian; we want to set its gradient to 001/14/19 CSE 802. Prepared by Martin Law10Back to the Original ProblemThe Lagrangian isNote that ||w||2 = wTw Setting the gradient of w.r.t. w and b to zero, we have01/14/19 CSE 802. Prepared by Martin Law11The Dual ProblemIf we substitute to , we have Note that This is a function of i only01/14/19 CSE 802. Prepared by Martin Law12The Dual ProblemThe new objective function is in terms of i onlyIt is known as the dual problem: if we know w, we know all i; if we know all i, we know wThe original problem is known as the primal problemThe objective function of the dual problem needs to be maximized!The dual problem is therefore:Properties of i when we introduce the Lagrange multipliersThe result when we differentiate the original Lagrangian w.r.t. b01/14/19 CSE 802. Prepared by Martin Law13The Dual ProblemThis is a quadratic programming (QP) problemA global maximum of i can always be foundw can be recovered by01/14/19 CSE 802. Prepared by Martin Law14Characteristics of the SolutionMany of the i are zerow is a linear combination of a small number of data pointsThis “sparse” representation can be viewed as data compression as in the construction of knn classifierxi with non-zero i are called support vectors (SV)The decision boundary is determined only by the SVLet tj (j=1, ..., s) be the indices of the s support vectors. We can writeFor testing with a new data zCompute and classify z as class 1 if the sum is positive, and class 2 otherwiseNote: w need not be formed explicitly01/14/19 CSE 802. Prepared by Martin Law15The Quadratic Programming ProblemMany approaches have been proposedLoqo, cplex, etc. (see http://www.numerical.rl.ac.uk/qp/qp.html)Most are “interior-point” methodsStart with an initial solution that can violate the constraintsImprove this solution by optimizing the objective function
View Full Document