CMU CS 10701 - Lecture - D180790

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10701> Lecture

DOC PREVIEW

CMU CS 10701 - Lecture

School name Carnegie Mellon University

Course Cs 10701- Introduction to Machine Learning

Pages 22

This preview shows page 1-2-21-22 out of 22 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Solving optimal margin classifier z Recall our opt problem max w b s t 1 w yi wT xi b 1 i z This is equivalent to z 1 T w w 2 s t 1 yi wT xi b 0 i Write the Lagrangian min w b L w b z m 1 T w w i yi wT xi b 1 2 i 1 Recall that can be reformulated as min w b max i 0 L w b Now we solve its dual problem max i 0 min w b L w b 1 The Dual Problem max i 0 min w b L w b z We minimize L with respect to w and b first m w L w b w i yi xi 0 i 1 m b L w b i yi 0 i 1 m w i yi xi Note that implies i 1 z Plus back to L and using we have m L w b i i 1 1 m i j yi y j xTi x j 2 i j 1 The Dual problem cont z Now we have the following dual opt problem m max J i i 1 s t i 0 i 1 K k m y i 1 z 1 m i j yi y j xTi x j 2 i j 1 i i 0 This is again a quadratic programming problem z A global maximum of i can always be found z But what s the big deal z Note two things 1 w can be recovered by m w i yi x i See next xTi x j More later i 1 2 The kernel 2 Support vectors z Note the KKT condition only a few i s can be nonzero i g i w 0 i 1 K k Class 2 8 0 6 Call the training data points whose i s are nonzero the support vectors SV 10 0 7 0 5 0 4 0 9 0 Class 1 2 0 1 0 8 6 1 4 3 0 Support vector machines z Once we have the Lagrange multipliers i we can reconstruct the parameter vector w as a weighted combination of the training examples w y x i SV z i i i For testing with a new data z z Compute wT z b y x z b i SV i i T i and classify z as class 1 if the sum is positive and class 2 otherwise z Note w need not be formed explicitly 3 Interpretation of support vector machines z z z The optimal w is a linear combination of a small number of data points This sparse representation can be viewed as data compression as in the construction of kNN classifier To compute the weights i and to use support vector machines we need to specify only the inner products or T kernel between the examples x i x j We make decisions by comparing each new example z with only the support vectors y sign i yi xTi z b i SV Non linearly Separable Problems Class 2 Class 1 z z We allow error i in classification it is based on the output of the discriminant function wTx b i approximates the number of misclassified samples 4 Soft Margin Hyperplane z Now we have a slightly different opt problem min w b s t m 1 T w w C i 2 i 1 yi wT xi b 1 i i i 0 i z i are slack variables in optimization z Note that i 0 if there is no error for xi z i is an upper bound of the number of errors z C tradeoff parameter between error and margin The Optimization Problem z The dual of this new constrained optimization problem is m max s t J i i 1 0 i C i 1 K k m y i 1 z z 1 m i j yi y j xTi x j 2 i j 1 i i 0 This is very similar to the optimization problem in the linear separable case except that there is an upper bound C on i now Once again a QP solver can be used to find i 5 Extension to Non linear Decision Boundary z So far we have only considered large margin classifier with a linear decision boundary z How to generalize it to become nonlinear z Key idea transform xi to a higher dimensional space to make life easier z Input space the space the point xi are located z Feature space the space of xi after transformation Why transform z z Linear operation in the feature space is equivalent to non linear operation in input space z Classification can become easier with a proper transformation Transforming the Data Feature space Input space Note feature space is of higher dimension than the input space in practice z Computation in the feature space can be costly because it is high dimensional z z The feature space is typically infinite dimensional The kernel trick comes to rescue 6 The Kernel Trick z Recall the SVM optimization problem m max s t J i i 1 1 m i j yi y j xTi x j 2 i j 1 0 i C i 1 K k m y i 1 i i 0 z The data points only appear as inner product z As long as we can calculate the inner product in the feature space we do not need the mapping explicitly z Many common geometric operations angles distances can be expressed by inner products z Define the kernel function K by K x i x j x i T x j An Example for feature mapping and kernels z z z Consider an input x x1 x2 Suppose is given as follows x 1 1 2 x1 2 x2 x12 x22 2 x1 x2 x2 An inner product in the feature space is x x 1 1 x x 2 z 2 So if we define the kernel function as follows there is no need to carry out explicitly K x x 1 xT x 2 7 More examples of kernel functions z Linear kernel we ve seen it K x x xT x z Polynomial kernel we just saw an example K x x 1 xT x p where p 2 3 To get the feature vectors we concatenate all pth order polynomial terms of the components of x weighted appropriately z Radial basis kernel 2 1 K x x exp x x 2 In this case the feature space consists of functions and results in a nonparametric classifier Kernelized SVM z Training m max s t J i i 1 0 i C i 1 K k m y i 1 z 1 m i j yi y j K xi x j 2 i j 1 i i 0 Using y sign i yi K x i z b i SV 8 SVM examples Examples for Non Linear SVMs Gaussian Kernel 9 Cross validation error z The leave one out cross validation error does not …

View Full Document

CMU CS 10701 - Lecture

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-21-22 out of 22 pages.

CMU CS 10701 - Lecture

Sign up for free to view:

Please select your school