1Solving optimal margin classifierz Recall our opt problem:z This is equivalent toz Write the Lagrangian:z Recall that (*) can be reformulated asNow we solve its dual problem: ibxwywiTibw∀≥+ ,)(s.tmax,11ibxwywwiTiTbw∀≤+− ,)(s.tmin,0121[]∑=−+−=miiTiiTbxwywwbw1121)(),,(ααL*( )),,(maxmin,ααbw ibwL0≥),,(minmax,ααbwbwiL0≥2***( )The Dual Problemz We minimize Lwith respect to w and b first:Note that (*) implies: z Plus (***) back to L, and using (**), we have:),,(minmax,ααbwbwiL0≥, ),,(∑==−=∇miiiiwxywbw10ααL, ),,(∑===∇miiibybw10ααL∑==miiiixyw1α*( )∑∑==−=mjijTijijimiiyybw1121,)(),,( xxααααL**( )The Dual problem, cont.z Now we have the following dual opt problem:z This is, (again,) a quadratic programming problem.z A global maximum of αican always be found. z But what's the big deal??z Note two things:1. w can be recovered by 2. The "kernel"∑∑==−=mjijTijijimiiyy1121,)()(max xxαααααJ. ,, , s.t.∑===≥miiiiyki1010ααK∑==miiiiyw1xαjTixxSee next …More later …3Support vectorsz Note the KKT condition --- only a few αi's can be nonzero!!kiwgαii,, ,)( K10==α6=1.4Class 1Class 2α1=0.8α2=0α3=0α4=0α5=0α7=0α8=0.6α9=0α10=0Call the training data points whose αi's are nonzero the support vectors (SV) Support vector machinesz Once we have the Lagrange multipliers {αi}, we can reconstruct the parameter vector w as a weighted combination of the training examples:z For testing with a new data zz Compute and classify z as class 1 if the sum is positive, and class 2 otherwisez Note: w need not be formed explicitly∑∈=SViiiiyw xα()bzybzwSViTiiiT+=+∑∈xα4Interpretation of support vector machinesz The optimal w is a linear combination of a small number of data points. This “sparse” representation can be viewed as data compression as in the construction of kNN classifierz To compute the weights {αi}, and to use support vector machines we need to specify only the inner products (or kernel) between the examples z We make decisions by comparing each new example z with only the support vectors:jTixx()⎟⎠⎞⎜⎝⎛+=∑∈bzyySViTiiixαsign*Non-linearly Separable Problemsz We allow “error” ξiin classification; it is based on the output of the discriminant function wTx+bz ξiapproximates the number of misclassified samplesClass 1Class 25Soft Margin Hyperplanez Now we have a slightly different opt problem:z ξiare “slack variables” in optimizationz Note that ξi=0 if there is no error for xiz ξiis an upper bound of the number of errorsz C : tradeoff parameter between error and margin , ,)(s.tiibxwyiiiTi∀≥∀−≥+01ξξ∑=+miiTbwCww121ξ,minThe Optimization Problemz The dual of this new constrained optimization problem isz This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on αinowz Once again, a QP solver can be used to find αi∑∑==−=mjijTijijimiiyy1121,)()(max xxαααααJ . ,, ,0 s.t.∑===≤≤miiiiykiC101ααK6Extension to Non-linear Decision Boundaryz So far, we have only considered large-margin classifier with a linear decision boundaryz How to generalize it to become nonlinear?z Key idea: transform xito a higher dimensional space to “make life easier”z Input space: the space the point xiare locatedz Feature space: the space of φ(xi) after transformationz Why transform?z Linear operation in the feature space is equivalent to non-linear operation in input spacez Classification can become easier with a proper transformation. Transforming the Dataz Computation in the feature space can be costly because it is high dimensionalz The feature space is typically infinite-dimensional!z The kernel trick comes to rescueφ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ(.)φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )φ( )Feature spaceInput spaceNote: feature space is of higher dimension than the input space in practice7The Kernel Trickz Recall the SVM optimization problemz The data points only appear as inner productz As long as we can calculate the inner product in the feature space, we do not need the mapping explicitlyz Many common geometric operations (angles, distances) can be expressed by inner productsz Define the kernel function K by∑∑==−=mjijTijijimiiyy1121,)()(max xxαααααJ . ,, ,0 s.t.∑===≤≤miiiiykiC101ααK)()(),(jTijiK xxxxφφ=An Example for feature mapping and kernelsz Consider an input x=[x1,x2]z Suppose φ(.) is given as followsz An inner product in the feature space isz So, if we define the kernel function as follows, there is no need to carry out φ(.) explicitly21222121212221 xxxxxxxx,,,,,=⎟⎟⎠⎞⎜⎜⎝⎛⎥⎦⎤⎢⎣⎡φ=⎟⎟⎠⎞⎜⎜⎝⎛⎥⎦⎤⎢⎣⎡⎟⎟⎠⎞⎜⎜⎝⎛⎥⎦⎤⎢⎣⎡'',2121xxxxφφ()21 ')',( xxxxTK +=8More examples of kernel functionsz Linear kernel (we've seen it)z Polynomial kernel (we just saw an example)where p = 2, 3, … To get the feature vectors we concatenate all pth order polynomial terms of the components of x (weighted appropriately)z Radial basis kernelIn this case the feature space consists of functions and results in a non-parametric classifier.')',( xxxxTK =()pTK ')',( xxxx += 1⎟⎠⎞⎜⎝⎛−−=221'exp)',( xxxxKKernelized SVMz Training:z Using:∑∑==−=mjijijijimiiKyy1,1),(21)(max xxαααααJ . ,, ,0 s.t.∑===≤≤miiiiykiC101ααK()⎟⎠⎞⎜⎝⎛+=∑∈bzKyySViiii,sign* xα9SVM examplesExamples for Non Linear SVMs –Gaussian Kernel10Cross-validation errorz The leave-one-out cross-validation error does not depend on the dimensionality of the feature space but only on the # of support vectors!examples trainingof #ctorssupport ve #error CVout -one-Leave =Machine LearningMachine Learning1010--701/15701/15--781, Fall 2006781, Fall 2006BoostingBoostingEric XingEric XingLecture 9, October 10, 2006Reading: Chap. 14.3, C.B book11Rationale: Combination of methodsz There is no algorithm that is always the most accuratez We can select simple “weak” classification or regression methods and combine them into a single “strong” methodz Different learners use differentz Algorithmsz Hyperparametersz Representations (Modalities)z Training setsz Subproblemsz The problem: how to combine themSome early algorithmsz Boosting by filtering (Schapire 1990)z Run weak learner on differently filtered example setsz Combine weak hypothesesz Requires knowledge on the performance of
View Full Document