Support Vector Machines and KernelsSlide 2Supervised ML = PredictionWhy might predictions be wrong?Why might predictions be wrong?Slide 6Representational BiasSupport Vector MachinesStrengths of SVMsLinear SeparatorsIntuitionsSlide 12Slide 13Slide 14A “Good” SeparatorNoise in the ObservationsRuling Out Some SeparatorsLots of NoiseMaximizing the Margin“Fat” SeparatorsWhy Maximize Margin?VC Dimension ExampleBounding Generalization ErrorSupport VectorsThe MathSlide 26Dual Optimization ProblemWhat if Data Are Not Perfectly Linearly Separable?Slide 29What if Surface is Non-Linear?Kernel MethodsWhen Linear Separators FailMapping into a New Feature SpaceKernelsThe Polynomial KernelSlide 36A Few Good KernelsThe Kernel TrickUsing a Different Kernel in the Dual Optimization ProblemExotic KernelsApplication: “Beautification Engine” (Leyvand et al., 2008)ConclusionSupport Vector Support Vector Machines and KernelsMachines and KernelsAdapted from slides by Tim OatesAdapted from slides by Tim OatesCognition, Robotics, and Learning (CORAL) LabCognition, Robotics, and Learning (CORAL) LabUniversity of Maryland Baltimore CountyUniversity of Maryland Baltimore CountyDoing Doing ReallyReally Well with Well with Linear Decision SurfacesLinear Decision SurfacesOutlineOutlinePredictionPredictionWhy might predictions be wrong?Why might predictions be wrong?Support vector machinesSupport vector machinesDoing really well with linear modelsDoing really well with linear modelsKernelsKernelsMaking the non-linear linearMaking the non-linear linearSupervised ML = PredictionSupervised ML = PredictionGiven training instances (x,y)Given training instances (x,y)Learn a model fLearn a model fSuch that f(x) = ySuch that f(x) = yUse f to predict y for new xUse f to predict y for new xMany variations on this basic themeMany variations on this basic themeWhy might predictions be wrong?Why might predictions be wrong?True Non-Determinism True Non-Determinism Flip a biased coinFlip a biased coinp(p(headsheads) = ) = Estimate Estimate If If > 0.5 predict > 0.5 predict headsheads, else , else tailstailsLots of ML research on problems like thisLots of ML research on problems like thisLearn a modelLearn a modelDo the best you can in expectationDo the best you can in expectationWhy might predictions be wrong? Why might predictions be wrong? Partial Observability Partial Observability Something needed to predict y is missing Something needed to predict y is missing from observation xfrom observation xN-bit parity problemN-bit parity problemx contains N-1 bits (hard PO)x contains N-1 bits (hard PO)x contains N bits but learner ignores some of them x contains N bits but learner ignores some of them (soft PO)(soft PO)Why might predictions be wrong?Why might predictions be wrong?True non-determinismTrue non-determinismPartial observabilityPartial observabilityhard, softhard, softRepresentational biasRepresentational biasAlgorithmic biasAlgorithmic biasBounded resourcesBounded resourcesRepresentational BiasRepresentational BiasHaving the right features (x) is crucialHaving the right features (x) is crucialXOO O O XXXXOOOOXXXSupport Vector Support Vector MachinesMachinesDoing Doing ReallyReally Well with Linear Well with Linear Decision SurfacesDecision SurfacesStrengths of SVMsStrengths of SVMsGood generalization in theoryGood generalization in theoryGood generalization in practiceGood generalization in practiceWork well with few training instancesWork well with few training instancesFind globally best modelFind globally best modelEfficient algorithmsEfficient algorithmsAmenable to the kernel trickAmenable to the kernel trickLinear SeparatorsLinear SeparatorsTraining instancesTraining instancesx x nny y {-1, 1} {-1, 1}w w nnb b HyperplaneHyperplane<w, x> + b = 0<w, x> + b = 0ww11xx11 + w + w22xx22 … + w … + wnnxxnn + b = 0 + b = 0Decision functionDecision functionf(x) = sign(<w, x> + b)f(x) = sign(<w, x> + b)Math ReviewMath ReviewInner (dot) product:Inner (dot) product:<a, b> = a · b = ∑ a<a, b> = a · b = ∑ aii*b*bii = a= a11bb11 + a + a22bb22 + …+a + …+annbbnnIntuitionsIntuitionsXXOOOOOOXXXXXXOOIntuitionsIntuitionsXXOOOOOOXXXXXXOOIntuitionsIntuitionsXXOOOOOOXXXXXXOOIntuitionsIntuitionsXXOOOOOOXXXXXXOOA “Good” SeparatorA “Good” SeparatorXXOOOOOOXXXXXXOONoise in the ObservationsNoise in the ObservationsXXOOOOOOXXXXXXOORuling Out Some SeparatorsRuling Out Some SeparatorsXXOOOOOOXXXXXXOOLots of NoiseLots of NoiseXXOOOOOOXXXXXXOOMaximizing the MarginMaximizing the MarginXXOOOOOOXXXXXXOO““Fat” SeparatorsFat” SeparatorsXXOOOOOOXXXXXXOOWhy Maximize Margin?Why Maximize Margin?Increasing margin reduces Increasing margin reduces capacitycapacityMust restrict capacity to generalize Must restrict capacity to generalize m training instancesm training instances22mm ways to label them ways to label themWhat if function class that can separate them What if function class that can separate them all?all?ShattersShatters the training instances the training instancesVC Dimension is largest m such that VC Dimension is largest m such that function class can shatter some set of m function class can shatter some set of m pointspointsVC Dimension ExampleVC Dimension ExampleXX XOX XXO XXX OOO XOX OXO OOO OBounding Generalization ErrorBounding Generalization ErrorR[f] = risk, test errorR[f] = risk, test errorRRempemp[f] = empirical risk, train error[f] = empirical risk, train errorh = VC dimensionh = VC dimensionm = number of training instancesm = number of training instances = probability that bound does not hold= probability that bound does not hold1m2mhln+ 14+ lnhR[f] Remp[f] +Support VectorsSupport VectorsXXOOOOOOOOXXXXXXThe MathThe MathTraining instancesTraining instancesx x nny y {-1, 1} {-1, 1}Decision functionDecision functionf(x) = sign(<w,x> + b)f(x) = sign(<w,x> + b)w w nnb b Find w and b that Find w and b that Perfectly classify training instancesPerfectly classify training instancesAssuming linear separabilityAssuming linear separabilityMaximize marginMaximize marginThe MathThe MathFor perfect classification, we wantFor perfect classification, we wantyyii
View Full Document