1. Stat 231. A.L. Yuille. Fall 20042. Beyond Linear Classifiers3. Which Feature Vectors?4. The Kernel Trick5. The Kernel TrickThe Kernel Trick6. Learning with Kernels7. Example Kernels8. Kernels, Mercer’s Theorem9. Kernels, Mercer’s Theorem10. Kernel Examples.11. Kernel PCA12.Summary.Lecture notes for Stat 231: Pattern Recognition and Machine Learning1. Stat 231. A.L. Yuille. Fall 2004Hyperplanes with Features.The “Kernel Trick”.Mercer’s Theorem.Kernels for Discrimination, PCA, Support Vectors.Read 5.11 Duda, Hart, Stork.And Or better, 12.3. Hastie, Tibshirani, Friedman.Lecture notes for Stat 231: Pattern Recognition and Machine Learning2. Beyond Linear ClassifiersIncrease the dimensions of the data using Feature Vectors. Search for a linear hyperplane between features Logical X-OR. X-OR requires decision ruleImpossible with a linear classifer. DefineThen solve XOR by hyperplaneLecture notes for Stat 231: Pattern Recognition and Machine Learning3. Which Feature Vectors?With sufficient feature vectors we can perform any classification using the linear separation algorithms applied to feature space. Two Problems: 1. How to select the features? 2. How to achieve Generalization and prevent overlearning? The Kernel Trick simplifies both problems. (But we won’t address (2) for a few lectures).Lecture notes for Stat 231: Pattern Recognition and Machine Learning4. The Kernel TrickKernel Trick: Define the kernel:Claim: linear separation algorithms in feature space only depends onClaim: we can use all results from linear separation (previous two lectures) by replacing all dot-productsLecture notes for Stat 231: Pattern Recognition and Machine Learning5. The Kernel TrickHyperplanes in feature space are surfaces for With associated classifierDetermine the classifier that maximizes the margin, as in previous lecture, replacing The dual problem depends only on by the dot productsReplace them byLecture notes for Stat 231: Pattern Recognition and Machine LearningThe Kernel TrickSolve the dual to get the which depends only on K.Then the solution is:Lecture notes for Stat 231: Pattern Recognition and Machine Learning6. Learning with KernelsAll the material in the previous lecture can be adapted directlyBy replacing the dot product by the kernelMargins, Support Vectors, Primal and Dual Problems.Just specify the kernel, don’t bother with the features The kernel trick depends on the quadratic nature of the learning problem. It can be applied to other quadratic problems, eg. PCA.Lecture notes for Stat 231: Pattern Recognition and Machine Learning7. Example KernelsPopular kernels are Constants:What conditions, if any, need we put on kernels to ensure that they can be derived from features?Lecture notes for Stat 231: Pattern Recognition and Machine Learning8. Kernels, Mercer’s TheoremFor a finite dataset express kernel as a matrix with components The matrix is symmetric and positive definite matrix with eigenvalues and eigenvectorsThenFeature vectors:Lecture notes for Stat 231: Pattern Recognition and Machine Learning9. Kernels, Mercer’s TheoremMercer’s Theorem extends this result to Functional Analysis (F.A). Most results in Linear Algebra can be extended to F.A. (Matrices with infinite dimensions). E.G. We define eigenfunctions of requiring finite Provided is positive definite, the features are Almost any kernel is okay.Lecture notes for Stat 231: Pattern Recognition and Machine Learning10. Kernel Examples.Figure of kernel discrimination.Lecture notes for Stat 231: Pattern Recognition and Machine Learning11. Kernel PCAThe kernel trick can be applied to any quadratic problem.PCA: Seek eigenvectors and eigenvalues ofWhere, wlog In feature space, replaceAll non-zero eigenvectors are of formReduces to solvingThenLecture notes for Stat 231: Pattern Recognition and Machine Learning12.Summary.The Kernel Trick allows us to do linear separation in feature space.Just specify the kernel, no need to explicitly specify the features.Replace dot product with the kernel.Allows classifications impossible using linear separation on original
View Full Document