Rutgers University CS 536 - Introduction to Support Vector Machines - D2573279

Home> Schools> Rutgers University- The State University of New Jersey> (CS) > CS 536> Introduction to Support Vector Machines

Rutgers University CS 536 - Introduction to Support Vector Machines

School name Rutgers University- The State University of New Jersey

Course Cs 536- Machine Learning

Pages 19

Download Save

Unformatted text preview:

1Introduction to SupportVector MachinesCS 536: Machine LearningLittman (Wu, TA)AdministrationSlides borrowed from Martin Law(from the web).2Outline• History of support vector machines (SVM)• Two classes, linearly separable– What is a good decision boundary?• Two classes, not linearly separable• How to make SVM non-linear: kernel trick• Demo of SVM• Epsilon support vector regression (e-SVR)• ConclusionHistory of SVM• SVM is a classifier derived from statisticallearning theory by Vapnik andChervonenkis• SVM was first introduced in COLT-92• SVM became famous when, using pixelmaps as input, it gave accuracycomparable to NNs with hand-designedfeatures in a handwriting recognition task• Currently, SVM is closely related to:– Kernel methods, large margin classifiers,reproducing kernel Hilbert space, Gaussianprocess, Boosting3Linear Separable CaseClass 1Class 2• Many decisionboundaries canseparate thesetwo classes• Which oneshould wechoose?Bad Decision BoundariesClass 1Class 2Class 1Class 24Margin Should Be Large• The decision boundary should be as faraway from the data as possible– We should maximize the margin, mClass 1Class 2mThe Optimization Problem• Let {x1, ..., xn} be our data set and let yi Œ{1,-1} be the class label of xi• The decision boundary should classify allpoints correctly fi• A constrained optimization problem5The Optimization Problem• We can transform the problem to its dual• This is a quadratic programming (QP)problem– Global maximum of ai can always be found• w can be recovered byCharacteristics of the Solution• Many of the ai are zero– w is a linear combination of a small number of examples– Sparse representation• xi with non-zero ai are called support vectors (SV)– The decision boundary is determined only by the SV– Let tj (j=1, ..., s) be the indices of the s support vectors.We can write• For testing with a new data z– Compute andclassify z as class 1 if the sum is positive, and class 2otherwise6a6=1.4A Geometric InterpretationClass 1Class 2a1=0.8a2=0a3=0a4=0a5=0a7=0a8=0.6a9=0a10=0Some Notes• There are PAC-type bounds on the erroron unseen data for SVM as a function ofthe margin– The larger the margin, the tighter the bound– The smaller the number of SV, the tighter thebound• Note that in both training and testing,the data are referenced only as innerproducts, xTy– This is important for generalizing to the non-linear case7Non-Linearly Separable• We allow “error” xi in classificationClass 1Class 2Soft Margin Hyperplane• Define xi=0 if there is no error for xi– xi are just “slack variables” inoptimization theory–• We want to minimize–C : tradeoff parameter between errorand margin• The optimization problem becomes8The Optimization Problem• The dual of the problem is• w is also recovered as• The only difference with the linearlyseparable case is that there is an upperbound C on ai• Once again, a QP solver can be used tofind aiExtension to Non-linear• Key idea: transform xi to a higherdimensional space to “make life easier”– Input space: the space containing xi– Feature space: the space of f(xi) aftertransformation• Why transform?– Linear operation in the feature space isequivalent to non-linear operation in inputspace– The classification task can be “easier” with aproper transformation. Example: XOR9Extension to Non-linear• Possible problem of the transformation– High computation burden and hard to get agood estimate• SVM solves these issues simultaneously– Kernel tricks for efficient computation– Minimize ||w||2 can lead to a “good” classifierf( )f( )f( )f( )f( )f( )f( )f( )f(.)f( )f( )f( )f( )f( )f( )f( )f( )f( )f( )Feature spaceInput spaceExample Transformation• Define the kernel function K (x,y) as• Consider the following transformation• The inner product can be computed by Kwithout going through the map f(.)10Kernel Trick•The relationship between the kernel functionK and the mapping f(.) is– This is known as the kernel trick•In practice, we specify K, thereby specifyingf(.) indirectly, instead of choosing f(.)•Intuitively, K (x,y) represents our desirednotion of similarity between data x and y andthis is from our prior knowledge•K (x,y) needs to satisfy a technical condition(Mercer condition) in order for f(.) to existExample Kernel Functions• Polynomial kernel with degree d• Radial basis function kernel with width s– Closely related to radial basis function neuralnetworks• Sigmoid with parameter k and q– It does not satisfy the Mercer condition on all kand q• Research on different kernel functions indifferent applications is very active11Handwriting RecognitionUsing Kernel Functions• Change all inner products to kernelfunctions• For training,OriginalWith kernelfunction12Using Kernel Functions• For testing, the new data z isclassified as class 1 if f ≥0, and asclass 2 if f <0OriginalWith kernelfunctionExample• Suppose we have 5 1D data points– x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 asclass 1 and 4, 5 as class 2 fi y1=1, y2=1,y3=-1, y4=-1, y5=1• We use the polynomial kernel of degree 2– K(x,y) = (xy+1)2– C is set to 100• We first find ai (i=1, …, 5) by13Example• By using a QP solver, we get– a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833– Note that the constraints are indeed satisfied– The support vectors are {x2=2, x4=5, x5=6}• The discriminant function is•b is recovered by solving f(2)=1 or byf(5)=-1 or by f(6)=1, as x2, x4, x5 lie on and all give b=9ExampleValue of discriminant function1 2 4 5 6class 2class 1class 114Multi-class Classification• SVM is basically a two-class classifier• One can change the QP formulation to allowmulti-class classification• More commonly, the data set is divided into twoparts “intelligently” in different ways and aseparate SVM is trained for each way of division• Multi-class classification is done by combiningthe output of all the SVM classifiers– Majority rule– Error correcting code– Directed acyclic graphSoftware• A list of SVM implementation can befound at http://www.kernel-machines.org/software.html• Some implementation (such as LIBSVM)can handle multi-class classification• SVMLight is among one of the earliestimplementation of SVM• Several Matlab toolboxes for SVM are alsoavailable15Steps for Classification• Prepare the pattern matrix• Select the kernel function to use• Select

View Full Document


School:
Email:
New Password:
Confirm Password:

Rutgers University CS 536 - Introduction to Support Vector Machines

Sign up for free to view:

Please select your school