CS 59000 Statistical Machine learningLecture 12Yuan (Alan) QiOutline• Review of Laplace approximation, BIC, Bayesian logistic regression • Kernel methods• Kernel ridge regression• Kernel construction• Kernel principle component analysisLaplace Approximation for PosteriorGaussian approximation around mode:Evidence ApproximationBayesian Information CriterionApproximation of Laplace approximation:More accurate evidence approximation neededBayesian Logistic RegressionKernel MethodsPredictions are linear combinations of a kernel function evaluated at training data points.Kernel function <-> feature space mappingLinear kernel:Stationary kernels:Fast Evaluation of Inner Product of Feature Mappings by Kernel FunctionsInner product needs computing six feature values and 3 x 3 = 9 multiplicationsKernel function has 2 multiplications and a squaringKernel Trick1. Reformulate an algorithm such that input vector enters only in the form of inner product . 2. Replace input x by its feature mapping:3. Replace the inner product by a Kernel function:Examples: Kernel PCA, Kernel Fisher discriminant, Support Vector MachinesDual variables:Dual Representation for Ridge RegressionKernel Ridge RegressionUsing kernel trick:Now the cost function depends on input only through the Gram matrix.Kernel Ridge RegressionEquivalent cost function over dual variables:Minimize over dual variables:Constructing Kernel functionExample: Gaussian kernelConsider Gaussian kernel:Why is it a valid kernel?Example: Gaussian kernelConsider Gaussian kernel:Why is it a valid kernel?Generalization:Combining Generative & Discriminative Models by KernelsSince each modeling approach has distinct advantages, how to combine them?• Use generative models to construct kernels • Use these kernels in discriminative approachesMeasure Probability Similarity by Kernels Simple inner product:For mixture distribution:For infinite mixture models:For models with latent variables (e.g,. Hidden Markov Models:)Fisher KernelsFisher Score:Fisher Information Matrix:Fisher Kernel:Sample Average:Principle Component Analysis (PCA)Assume We haveis a normalized eigenvector:Feature MappingEigen-problem in feature spaceDual VariablesSuppose (why it cannot be smaller than 0?), we haveEigen-problem in Feature Space (1)Multiplying both sides by , we obtainEigen-problem in Feature Space (2)Normalization condition:Projection coefficient:General Case for Non-zero Mean CaseKernel Matrix:Kernel PCA on Synthetic DataContour plots of projection coefficients in feature spaceLimitations of Kernel PCADiscussion…Limitations of Kernel PCAIf N is big, it is computationally expensive since K is N by N while S is D by D.Not easy for low-rank
View Full Document