Support'Vector'Machines'Aar$%Singh%%%Machine%Learning%101601%Nov%22,%2011%TexPoint%fonts%used%in%EMF.%%Read%the%TexPoint%manual%before%you%delete%this%box.:%AAAAAAA%SVMs'reminder'26%%%%min%%w.w%+%C%Σ%ξj%!%w,b,ξ%%s.t.%(w.xj+b)%yj%≥%11ξj% ∀j%%% %ξj%≥%0 ∀j%Hinge loss Regularization Soft margin approach Essen$ally%a%constrained%op$miza$on%problem!%Constrained'Op7miza7on'Primal'problem:'⌘ minxmax↵0x2 ↵(x b)minxmax↵0x2 ↵(x b)Lagrange%mul$plier%Lagrangian%L(x, ↵)Dual'problem:'max↵0minxx2 ↵(x b)⌘ minxmax↵0x2 ↵(x b)L(x, ↵)f⇤= argd⇤= argd⇤ f⇤In%general,%d⇤= f⇤When%is%%%%%%%% %%%%%%%%%%%%%? %Constrained'Op7miza7on'28%Primal'problem:'Dual'problem:'f⇤= ar gd⇤= arg↵⇤= dual solutionx⇤= primal solutiond⇤= f⇤if%%%%%f%is%convex%and%%%x*,%α*%sa$sfy%the%KKT%condi$ons:%rL(x⇤,↵⇤)=0↵⇤ 0x⇤ b↵⇤(x⇤ b)=0Zero%Gradient%%Primal%feasibility%Dual%feasibility%Complementary%slackness%è If α*%>%0,%then%x*%=%b%%%%%i.e.%%%Constraint%is%effec$ve%Constrained'Op7miza7on'29%α*%=%0%Constraint%is%ineffec$ve%α* = 1/2 >%0%Constraint%is%effec$ve%Complementary%slackness%α*(x*11)%=%0%Dual'SVM'–'linearly'separable'case'• Primal%problem:%%%• Lagrangian:%%%30%w – weights on features α – weights on training pts α j'>'0''constraint'is'effec7ve'''''(w.xj+b)yj%=%1%%%%%%%%%%%point%j%is%a%support%vector!!%Dual'SVM'–'linearly'separable'case'• Dual%problem%deriva$on:%%%31%If%%we%can%solve%for%αs%(dual%problem),%then%%we%have%a%solu$on%for%w%(primal%problem)%%Dual'SVM'–'linearly'separable'case'• Dual%problem%deriva$on:%• Dual%problem:%%%32%Dual'SVM'–'linearly'separable'case'%%%Dual%problem%is%also%QP%Solu$on%gives%αjs%33%Use%support%vectors%to%compute%b%w.xk+b%=%yk%%%%%%%%%%%(w.xk+b)yk%=%1%• Dual%problem:%%%Dual'SVM'Interpreta7on:'Sparsity'34%w.x%+%b%=%0%Only%few%αjs%can%be%non1zero%:%where%constraint%is%$ght%%%%%(w.xj%+%b)yj%=%1%%Support'vectors%–%training%points%j%whose%αjs%are%non1zero%αj'>'0'αj'>'0'αj'>'0'αj'='0'αj'='0'αj'='0'='Support'vectors'So'why'solve'the'dual'SVM?'• There%are%some%quadra$c%programming%algorithms%that%can%solve%the%dual%faster%than%the%primal,%specially%in%high%dimensions%d>>n%%%Recall:%%%(w1,%w2,%…,%wd,%b)%–%d+1%primal%variables%%%%α1, α2, ..., αn%–%n%dual%variables%%35%Dual'SVM'–'nonHseparable'case'36%• Primal%problem:%• Dual%problem:%%%Lagrange''Mul7pliers'Dual'SVM'–'nonHseparable'case'37%%%%%%%Dual%problem%is%also%QP%Solu$on%gives%αjs%comes%from%Intuition:%%Earlier%1%If%constraint%violated,%αi%→∞%Now%1%If%constraint%violated,%αi%≤%C%So'why'solve'the'dual'SVM?'• There%are%some%quadra$c%programming%algorithms%that%can%solve%the%dual%faster%than%the%primal,%specially%in%high%dimensions%d>>n%%%Recall:%%%(w1,%w2,%…,%wd,%b)%–%d+1%primal%variables%%%%α1, α2, ..., αn%–%n%dual%variables%%• But,%more%importantly,%the%“kernel'trick”!!!%38%39%What'if'data'is'not'linearly'separable?'x1%Using%non1linear%features%to%get%linear%separa$on%x2%Radius,%r%=%√x12+x22%%Angle,%θ%%40%What'if'data'is'not'linearly'separable?'x%Using%non1linear%features%to%get%linear%separa$on%y%x%x2%What'if'data'is'not'linearly'separable?'41%Use'features'of'fe atur e s''of'features'of ' features….'Feature%space%becomes%really%large%very%quickly!%Can%we%get%by%without%having%to%write%out%the%features%explicitly?%Φ(x)%=%(x12,%x22,%x1x2,%….,%exp(x1))%Higher'Order'Polynom ials'42%d%–%input%features% % %m%–%degree%of%polynomial%grows%fast!%m%=%6,%d%=%100%about%1.6%billion%terms%(m + d 1)!m!(d 1)!dm✓m+d-1m◆Dual'formula7on'only'depends'on'dotHproducts,'not'on'w!'43%Φ(x)%–%High1dimensional%feature%space,%but%never%need%it%explicitly%as%long%as%we%can%compute%the%dot%product%fast%using%some%Kernel%K%Dot'Product'of'Polynomials'44%m=1%m=2%m%m√2%2%√2%Don’t%store%high1dim%features%1%Only%evaluate%dot1products%with%kernels%%m%=%K(x,z)%Finally:'The'Kernel'Trick!'45%• Never%represent%features%explicitly%– Compute%dot%products%in%closed%form%• Constant1$me%high1dimensional%dot1products%for%many%classes%of%features%%Common'Kernels'46%• Polynomials%of%degree%d%• Polynomials%of%degree%up%to%d%• Gaussian/Radial%kernels%(polynomials%of%all%orders%–%recall%series%expansion)%• Sigmoid%Which'Func7ons'Can'Be'Kernels?'• not%all%func$ons%• for%some%defini$ons%of%K(x1,x2)%there%is%no%corresponding%projec$on%ϕ(x)%• Nice%theory%on%this,%including%how%to%construct%new%kernels%from%exi s $ng%ones%• Ini$ally%kernels%were%defined%over%data%points%in%Euclidean%space,%but%more%recently%over%strings,%over%trees,%over%graphs,%…%OverfiYng'48%• Huge%feature%space%with%kernels,%what%about%overfiyng???%– Maximizing%margin%leads%to%sparse%set%of%support%vectors%(decision%boundary%not%too%complicated)%– Some%interes$ng%theory%says%that%SVMs%search%for%simple%hypothesis%with%large%margin%– O{en%robust%to%overfiyng%What'about'classifica7on'7me?'49%• For%a%new%input%x,%if%we%need%to%represent%Φ(x),%we%are%in%trouble!%• Recall%classifier:%sign(w.Φ(x)+b)%• Using%kernels%we%are%cool!%SVMs'with'Kernels'50%• Choose%a%set%of%features%and%kernel%func$on%• Solve%dual%problem%to%obtain%support%vectors%αi%• At%classifica$on%$me,%compute:%Classify'as'SVM'Decision'Surface'using'Gaussian'Kernel'51%Bishop%Fig%7.2%Circled%points%are%the%support%vectors:%training%examples%with%non1zero%αj%%Points%plo|ed%in%original%21D%space.%%Contour%lines%show%constant%%SVM'So['Margin'Decision'Surface'using'Gaussian'Kernel'52%Bishop%Fig%7.4%Circled%points%are%the%support%vectors:%training%examples%with%non1zero%αj%%Points%plo|ed%in%original%21D%space.%%Contour%lines%show%constant%%−2 0 2−20253%SVMs'Kernel'Regression'or%Differences:'• SVMs:%– Learn%weights%αi%(and%bandwidth)%– O{en%sparse%solu$on%• KR:%– Fixed%“weights”,%learn%bandwidth%– Solu$on%may%not%be%sparse%– Much%simpler%to%implement%SVMs'vs.'Kernel'Regression'SVMs'vs.'Logis7c'Regression'54%SVMs Logistic Regression Loss function Hinge loss Log-loss High dimensional features with kernels Yes! Yes! Solution sparse Often yes! Almost always no! Semantics of output “Margin” Real probabilitiesKernels'in'Logis7c'Regression'55%• Define%weights%in%terms%of%features:%•
View Full Document