2006 Carlos Guestrin1SVMs, Duality and the Kernel Trick (cont.)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMarch 1st, 2006Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998)2006 Carlos Guestrin2SVMs reminder2006 Carlos Guestrin3Today’s lecture Learn one of the most interesting and exciting recent advancements in machine learning The “kernel trick” High dimensional feature spaces at no extra cost! But first, a detour Constrained optimization!2006 Carlos Guestrin4Dual SVM interpretationw.x+ b = 02006 Carlos Guestrin5Dual SVM formulation –the linearly separable case2006 Carlos Guestrin6Reminder from last time: What if the data is not linearly separable?Use features of features of features of features….Feature space can get really large really quickly!2006 Carlos Guestrin7Higher order polynomialsnumber of input dimensionsnumber of monomial termsd=2d=4d=3m – input featuresd – degree of polynomialgrows fast!d = 6, m = 100about 1.6 billion terms2006 Carlos Guestrin8Dual formulation only depends on dot-products, not on w!2006 Carlos Guestrin9Finally: the “kernel trick”! Never represent features explicitly Compute dot products in closed form Constant-time high-dimensional dot-products for many classes of features Very interesting theory – Reproducing Kernel Hilbert Spaces Not covered in detail in 10701/15781, more in 107022006 Carlos Guestrin10Common kernels Polynomials of degree d Polynomials of degree up to d Gaussian kernels Sigmoid2006 Carlos Guestrin11Overfitting? Huge feature space with kernels, what about overfitting??? Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting2006 Carlos Guestrin12What about at classification time For a new input x, if we need to represent Φ(x), we are in trouble! Recall classifier: sign(w.Φ(x)+b) Using kernels we are cool!2006 Carlos Guestrin13SVMs with kernels Choose a set of features and kernel function Solve dual problem to obtain support vectors αi At classification time, compute:Classify as2006 Carlos Guestrin14Remember kernel regressionRemember kernel regression???1. wi= exp(-D(xi, query)2/ Kw2)2. How to fit with the local points?Predict the weighted average of the outputs:predict = Σwiyi/ Σwi2006 Carlos Guestrin15SVMs v. Kernel RegressionSVMsKernel Regressionor2006 Carlos Guestrin16SVMs v. Kernel RegressionSVMsKernel RegressionorDifferences: SVMs: Learn weights \alpha_i (and bandwidth) Often sparse solution KR: Fixed “weights”, learn bandwidth Solution may not be sparse Much simpler to implement2006 Carlos Guestrin17What’s the difference between SVMs and Logistic Regression?High dimensional features with kernelsLoss functionNoYes!Log-lossHinge lossLogisticRegressionSVMs2006 Carlos Guestrin18Kernels in logistic regression Define weights in terms of support vectors: Derive simple gradient descent rule on αi2006 Carlos Guestrin19What’s the difference between SVMsand Logistic Regression? (Revisited)Almost always no!Often yes!Solution sparseYes!Yes!High dimensional features with kernelsReal probabilities“margin”Semantics of outputLoss function Log-lossHinge lossLogisticRegressionSVMs2006 Carlos Guestrin20What you need to know Dual SVM formulation How it’s derived The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression2006 Carlos Guestrin21Acknowledgment SVM applet: http://www.site.uottawa.ca/~gcaron/applets.htm2006 Carlos Guestrin22PAC-learning, VC Dimension and Margin-based BoundsMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityMarch 1st, 2005More details:General: http://www.learning-with-kernels.org/Example of more complex bounds:http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz2006 Carlos Guestrin23What now… We have explored many ways of learning from data But… How good is our classifier, really? How much data do I need to make it “good enough”?2006 Carlos Guestrin24A simple setting… Classification m data points Finite number of possible hypothesis (e.g., dec. trees of depth d) A learner finds a hypothesis h that is consistentwith training data Gets zero error in training – errortrain(h) = 0 What is the probability that h has more than εtrue error? errortrue(h) ≥ ε2006 Carlos Guestrin25How likely is a bad hypothesis to get m data points right? Hypothesis h that is consistent with training data →got m i.i.d. points right Prob. h with errortrue(h) ≥ ε gets one data point right Prob. h with errortrue(h) ≥ ε gets m data points right2006 Carlos Guestrin26But there are many possible hypothesis that are consistent with training data2006 Carlos Guestrin27How likely is learner to pick a bad hypothesis Prob. h with errortrue(h) ≥ ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one?2006 Carlos Guestrin28Union bound P(A or B or C or D or …)2006 Carlos Guestrin29How likely is learner to pick a bad hypothesis Prob. h with errortrue(h) ≥ ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one?2006 Carlos Guestrin30Review: Generalization error in finite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:2006 Carlos Guestrin31Using a PAC bound Typically, 2 use cases: 1: Pick ε and δ, give you m 2: Pick m and δ, give you ε2006 Carlos Guestrin32Review: Generalization error in finite hypothesis spaces [Haussler ’88] Theorem: Hypothesis space H finite, dataset Dwith m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:Even if h makes zero errors in training data, may make errors in test2006 Carlos Guestrin33Limitations of Haussler ‘88 bound Consistent classifier Size of hypothesis space2006 Carlos Guestrin34What if our classifier does not have zero error on the training
View Full Document