1Machine LearningMachine Learning1010--701/15701/15--781, Fall 2006781, Fall 2006Support Vector MachinesSupport Vector MachinesEric XingEric XingLecture 8, October 5, 2006Reading: Chap. 6&7, C.B bookOutlinez Maximum margin classificationz Constrained optimizationz Lagrangian dualityz Kernel trickz Non-separable cases2What is a good Decision Boundary?z Consider a binary classification task with y = ±1 labels (not 0/1 as before). z When the training examples are linearly separable, we can set the parameters of a linear classifier so that all the training examples are classified correctlyz Many decision boundaries!z Generative classifiersz Logistic regressions …z Are all decision boundaries equally good?Class 1Class 2Examples of Bad Decision Boundariesz Why we may have such boundaries?z Irregular distributionz Imbalanced training sizesz outliners3Classification and Marginz Parameterzing decision boundaryz Let w denote a vector orthogonal to the decision boundary, and b denote a scalar "offset" term, then we can write the decision boundary as:Class 1Class 2m0=+ bxwTz MarginwTx+b > 0 for all x in class 2wTx+b <0 for all x in class 1Or more compactly:(wTxi+b)yi>0The margin between two pointsm=(wTxi+b)-(wTxj+b)=wT(xi-xj)Maximum Margin Classificationz The margin is:z It make sense to set constrains on W:z Here is our Maximum Margin Classification problem:z Equivalently, we can instead work on this:()**jiTxxwm −=1=∀≥+wimbxwymiTibw ,)(s.tmax,imbxwywmiTibw∀≥+ ,)(s.tmax,4Maximum Margin Classification, con'd.z The optimization problem:z But note that the magnitude of m merely scales w and b, and does not change the classification boundary at all!z So we instead work on this cleaner problem:z The solution to this leads to the famous Support Vector Machines --- believed by many to be the best "off-the-shelf" supervised learning algorithmimbxwywmiTibw∀≥+ ,)(s.tmax,ibxwywiTibw∀≥+ ,)(s.tmax,11Support vector machinez A convex quadratic programming problemwith linear constrains:z The attained margin is now given byz Only a few of the classification constraints are relevant Î support vectorsz Constrained optimizationz We can directly solve this using commercial quadratic programming (QP) codez But we want to take a more careful investigation of Lagrange duality, and the solution of the above is its dual form. Î deeper insight: support vectors, kernels …Î more efficient algorithmibxwywiTibw∀≥+ ,)(s.tmax,11w1m+m-5Lagrangian Dualityz The Primal ProblemPrimal:The generalized Lagrangian:the α's (αι≥0) and β's are called the Lagarangian multipliers Lemma:A re-written Primal:liwhkiwgwfiiw,1, ,)(,1, ,)()(s.t.minKK===≤00∑∑==++=liiikiiiwhwgwfw11)()()(),,(βαβαL⎩⎨⎧∞=≥o/wsconstraint primal satisfies if)(),,(max,,wwfw iβααβαL0),,(maxmin,,βααβαw iwL0≥Lagrangian Duality, cont.z Recall the Primal Problem:z The Dual Problem:z Theorem (weak duality): z Theorem (strong duality):Iff there exist a saddle point of , we have),,(maxmin,,βααβαw iwL0≥),,(minmax,,βααβαwwiL0≥*,,,,*),,(maxmin ),,(minmax pw wdiiww=≤=≥≥βαβααβααβαLL00**pd =),,(βαwL6A sketch of strong and weak dualityz Now, ignoring h(x) for simplicity, let's look at what's happening graphically in the duality theorems.**)()(maxmin )()(minmax pwgw fwgwfdTwTwii=+≤+=≥≥αααα00)(wf)(wg)(wf)(wgThe KKT conditionsz If there exists some saddle point of L, then the saddle point satisfies the following "Karush-Kuhn-Tucker" (KKT) conditions:z Theorem: If w*, α* and β* satisfy the KKT condition, then it is also a solution to the primal and the dual problems.kikiwgkiwgαliwniwwiiiiii,, ,,, ,)(,, ,)(,, , ),,(,, , ),,(KKKKK1010101010=≥=≤====∂∂==∂∂αβαββαLL7Solving optimal margin classifierz Recall our opt problem:z This is equivalent toz Write the Lagrangian:z Recall that (*) can be reformulated asNow we solve its dual problem: ibxwywiTibw∀≥+ ,)(s.tmax,11ibxwywwiTiTbw∀≤+− ,)(s.tmin,0121[]∑=−+−=miiTiiTbxwywwbw1121)(),,(ααL*( )),,(maxmin,ααbw ibwL0≥),,(minmax,ααbwbwiL0≥***( )The Dual Problemz We minimize Lwith respect to w and b first:Note that (*) implies: z Plus (***) back to L, and using (**), we have:),,(minmax,ααbwbwiL0≥, ),,(∑==−=∇miiiiwxywbw10ααL, ),,(∑===∇miiibybw10ααL∑==miiiixyw1α*( )∑∑==−=mjijTijijimiiyybw1121,)(),,( xxααααL**( )8The Dual problem, cont.z Now we have the following dual opt problem:z This is, (again,) a quadratic programming problem.z A global maximum of αican always be found. z But what's the big deal??z Note two things:1. w can be recovered by 2. The "kernel"∑∑==−=mjijTijijimiiyy1121,)()(max xxαααααJ. ,, , s.t.∑===≥miiiiyki1010ααK∑==miiiiyw1xαjTixxSee next …More later …Support vectorsz Note the KKT condition --- only a few αi's can be nonzero!!kiwgαii,, ,)( K10==α6=1.4Class 1Class 2α1=0.8α2=0α3=0α4=0α5=0α7=0α8=0.6α9=0α10=0Call the training data points whose αi's are nonzero the support vectors (SV)9Support vector machinesz Once we have the Lagrange multipliers {αi}, we can reconstruct the parameter vector w as a weighted combination of the training examples:z For testing with a new data zz Compute and classify z as class 1 if the sum is positive, and class 2 otherwisez Note: w need not be formed explicitly∑∈=SViiiiyw xα()bzybzwSViTiiiT+=+∑∈xαInterpretation of support vector machinesz The optimal w is a linear combination of a small number of data points. This “sparse” representation can be viewed as data compression as in the construction of kNN classifierz To compute the weights {αi}, and to use support vector machines we need to specify only the inner products (or kernel) between the examples z We make decisions by comparing each new example z with only the support vectors:jTixx()⎟⎠⎞⎜⎝⎛+=∑∈bzyySViTiiixαsign*10Non-linearly Separable Problemsz We allow “error” ξiin classification; it is based on the output of the discriminant function wTx+bz ξiapproximates the number of misclassified samplesClass 1Class 2Soft Margin Hyperplanez Now we have a slightly different opt problem:z ξiare “slack variables” in optimizationz Note that ξi=0 if there is no error for xiz ξiis an upper bound of the number of errorsz C : tradeoff parameter between error and margin , ,)(s.tiibxwyiiiTi∀≥∀−≥+01ξξ∑=+miiTbwCww121ξ,min11The Optimization
View Full Document