Slide 1Slide 2The Big IdeaSlide 4Geometric IntuitionGeometric IntuitionPrimal VersionDUAL VersionPrimal vs DualDUAL: the “Support vector” version“Support Vector”s?“Support Vector”s?Playing With SVMSMore on KernelsSlide 15Can we used Kernels to Measure Distances?Continued:Popular Kernel MethodsSupport Vector Machines Joseph GonzalezFrom a linear classifier to ...*One of the most famous slides you will see, ever!The Big IdeaOXOOXXXXXXOOOOOOMaximum marginMaximum possible separation between positive and negative training examples*One of the most famous slides you will see, ever!Geometric IntuitionOXOOOXXXSUPPORT VECTORSGeometric IntuitionOXXOOOXXXSUPPORT VECTORSPrimal Versionmin ||w||2 +C ∑ξs.t. (w.x + b)y ≥ 1-ξξ ≥ 0DUAL VersionWhere did this come from?Remember Lagrange MultipliersLet us “incorporate” constraints into objectiveThen solve the problem in the “dual” space of lagrange multipliersmax ∑α -1/2 ∑αiαjyiyjxixjs.t. ∑αiyi = 0C ≥ αi ≥ 0Primal vs DualNumber of parameters?large # features?large # examples?for large # features, DUAL preferredmany αi can go to zero!max ∑α -1/2 ∑αiαjyiyjxixjs.t. ∑αiyi = 0C ≥ αi ≥ 0min ||w||2 +C ∑ξs.t. (w.x + b)y ≥ 1-ξξ ≥ 0DUAL: the “Support vector” versionHow do we find α?Quadratic programmingHow do we find C? Cross-validation! Wait... how do we predict y for a new point x?? How do we find w? How do we find b?y = sign(w.x+b)w = Σi αi yi ximax ∑α - 1/2 ∑αiαjyiyjxixjs.t. ∑αiyi = 0C ≥ αi ≥ 0y = sign(Σi αi yi xi xj + b)max α1 + α2 + 2α1α2 - α12/2 - 4α22s.t. α1-α2 = 0C ≥ αi ≥ 0“Support Vector”s?OXα1α2max ∑α - 1/2 ∑αiαjyiyjxixjs.t. ∑αiyi = 0C ≥ αi ≥ 0(0,1)(2,2)max ∑α - α1α2(-1)(0+2)- 1/2 α12(1)(0+1) - 1/2 α22(1)(4+4)w = Σi αi yi xiw = .4([0 1]-[2 2]) =.4[-2 -1 ]y=w.x+bb = y-w.xx1: b = 1-.4 [-2 -1][0 1] = 1+.4 =1.4 b 4/5α1=α2=αmax 2α -5/2α2 max 5/2α(4/5-α) 0 2/5α1=α2=2/5“Support Vector”s?OXα1α2max ∑α - 1/2 ∑αiαjyiyjxixjs.t. ∑αiyi = 0C ≥ αi ≥ 0(0,1)(2,2)Oα3What is α3? Try this at homePlaying With SVMS•http://www.csie.ntu.edu.tw/~cjlin/libsvm/More on Kernels•Kernels represent inner products–K(a,b) = a.b–K(a,b) = φ(a) . φ(b) •Kernel trick is allows extremely complex φ( ) while keeping K(a,b) simple•Goal: Avoid having to directly construct φ( ) at any point in the algorithmKernelsComplexity of the optimization problem remains only dependent on the dimensionality of the input space and not of the feature space!Can we used Kernels to Measure Distances?•Can we measure distance between φ(a) and φ(b) using K(a,b)?Continued:Popular Kernel Methods•Gaussian Processes•Kernel Regression (Smoothing)–Nadarayan-Watson Kernel
View Full Document