SVMs Duality and the Kernel Trick Machine Learning 10701 15781 Carlos Guestrin Carnegie Mellon University October 21st 2009 Carlos Guestrin 2005 2009 1 SVMs reminder Carlos Guestrin 2005 2009 2 1 Today s lecture Learn one of the most interesting and exciting recent advancements in machine learning The kernel trick High dimensional feature spaces at no extra cost But first a detour Constrained optimization Carlos Guestrin 2005 2009 3 Constrained optimization Carlos Guestrin 2005 2009 4 2 Lagrange multipliers Dual variables Moving the constraint to objective function Lagrangian Solve Carlos Guestrin 2005 2009 5 Lagrange multipliers Dual variables Solving Carlos Guestrin 2005 2009 6 3 Dual SVM derivation 1 the linearly separable case Carlos Guestrin 2005 2009 7 Dual SVM derivation 2 the linearly separable case Carlos Guestrin 2005 2009 8 4 Dual SVM interpretation Sparsity Carlos Guestrin 2005 2009 9 Dual SVM formulation the linearly separable case Carlos Guestrin 2005 2009 10 5 Dual SVM derivation the non separable case Carlos Guestrin 2005 2009 11 Dual SVM formulation the non separable case Carlos Guestrin 2005 2009 12 6 Why did we learn about the dual SVM There are some quadratic programming algorithms that can solve the dual faster than the primal But more importantly the kernel trick Another little detour Carlos Guestrin 2005 2009 13 Announcements Midterm When Thursday 10 29 5pm 6 30pm Where Doherty 2210 What You your pencil your textbook your notes course slides your calculator your good mood What NOT No computers iphones or anything else that has an internet connection Material Everything from the beginning of the semester until and including SVMs and the Kernel trick Carlos Guestrin 2005 2009 14 7 Reminder from last time What if the data is not linearly separable Use features of features of features of features Feature space can get really large really quickly Carlos Guestrin 2005 2009 15 number of monomial terms Higher order polynomials d 4 m input features d degree of polynomial d 3 d 2 number of input dimensions Carlos Guestrin 2005 2009 grows fast d 6 m 100 about 1 6 billion terms 16 8 Dual formulation only depends on dot products not on w Carlos Guestrin 2005 2009 17 Dot product of polynomials Carlos Guestrin 2005 2009 18 9 Finally the kernel trick Never represent features explicitly Constant time high dimensional dotproducts for many classes of features Very interesting theory Reproducing Kernel Hilbert Spaces Compute dot products in closed form Not covered in detail in 10701 15781 more in 10702 Carlos Guestrin 2005 2009 19 Polynomial kernels All monomials of degree d in O d operations How about all monomials of degree up to d Solution 0 Better solution Carlos Guestrin 2005 2009 20 10 Common kernels Polynomials of degree d Polynomials of degree up to d Gaussian kernels Sigmoid Carlos Guestrin 2005 2009 21 Overfitting Huge feature space with kernels what about overfitting Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting Carlos Guestrin 2005 2009 22 11 What about at classification time For a new input x if we need to represent x we are in trouble Recall classifier sign w x b Using kernels we are cool Carlos Guestrin 2005 2009 23 SVMs with kernels Choose a set of features and kernel function Solve dual problem to obtain support vectors i At classification time compute Classify as Carlos Guestrin 2005 2009 24 12 What s the difference between SVMs and Logistic Regression Loss function High dimensional features with kernels SVMs Logistic Regression Hinge loss Log loss Yes No Carlos Guestrin 2005 2009 25 Kernels in logistic regression Define weights in terms of support vectors Derive simple gradient descent rule on i Carlos Guestrin 2005 2009 26 13 What s the difference between SVMs and Logistic Regression Revisited Loss function SVMs Logistic Regression Hinge loss Log loss Yes Yes Often yes Almost always no Margin Real probabilities High dimensional features with kernels Solution sparse Semantics of output Carlos Guestrin 2005 2009 27 What you need to know Dual SVM formulation How it s derived The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression Carlos Guestrin 2005 2009 28 14
View Full Document