0 0 12 views

**Unformatted text preview:**

chapters 1 2 3 4 Introduction to Kernels Max Welling October 1 2004 1 Introduction What is the goal of pick your favorite name Machine Learning Data Mining Pattern Recognition Data Analysis Statistics Automatic detection of non coincidental structure in data Desiderata Robust algorithms insensitive to outliers and wrong model assumptions Stable algorithms generalize well to unseen data Computationally efficient algorithms large datasets 2 Let s Learn Something Find the common characteristic structure among the following statistical methods 1 Principal Components Analysis 2 Ridge regression 3 Fisher discriminant analysis 4 Canonical correlation analysis Answer We consider linear combinations of input vector f x wT x Linear algorithm are very well understood and enjoy strong guarantees convexity generalization bounds Can we carry these guarantees over to non linear algorithms 3 Feature Spaces d F x F x R F non linear mapping to F 1 high D space L2 2 infinite D countable space 3 function space Hilbert space example 2 F 2 x y x y 2 xy 4 Ridge Regression duality l problem min w yi wT xi 2 l w 2 i 1 target solution input regularization w X T X l I d 1 X T y dxd inverse X T XX T l I l 1 y l l inverse X T G l I l 1 y Gij xi x j l xia i i 1 linear comb data Dual Representation Gram matrix 5 Kernel Trick Note In the dual representation we used the Gram matrix to express the solution Kernel Trick Replace x F x kernel Gij xi x j GijF F xi F x j K xi x j If we use algorithms that only depend on the Gram matrix G then we never have to know compute the actual features F This is the crucial point of kernel methods 6 Modularity Kernel methods consist of two modules 1 The choice of kernel this is non trivial 2 The algorithm which takes kernels as input Modularity Any kernel can be used with any kernel algorithm some kernels 2 k x y e x y c k x y x y q d k x y tanh a x y q k x y 1 2 x y c 2 some kernel algorithms support vector machine Fisher discriminant analysis kernel regression kernel PCA kernel CCA 7 What is a proper kernel Definition A finitely positive semi definite function k x y R is a symmetric function of its arguments for which matrices formed by restriction on any finite subset of points is positive semi definite a T Ka 0 a Theorem A function k x y R can be written as k x y F x F y where F x is a feature map x F x F iff k x y satisfies the semi definiteness property Relevance We can now check if k x y is a proper kernel using only properties of k x y itself 8 i e without the need to know the feature map Reproducing Kernel Hilbert Spaces The proof of the above theorem proceeds by constructing a very special feature map note that more feature maps may give rise to a kernel F x F x k x i e we map to a function space definition function space reproducing property m f a i k xi any m xi i 1 m l f g a i b j k xi x j f F x f k x k a i k xi k x i 1 i 1 j 1 m l k f f a ia j k xi x j 0 a k x x f x finite positive semi definite F x F y k x y 9 i 1 j 1 i i i 1 Mercer s Theorem Theorem X is compact k x y is symmetric continuous function s t Tk f B k x f x dx is a positive semi definite operator Tk 0 i e k x y f x f y dxdy 0 f L2 X then there exists an orthonormal feature basis of eigen functions such that k x y F i x F j y i 1 Hence k x y is a proper kernel Note Here we construct feature vectors in L2 where the RKHS construction was in a function space 10 Learning Kernels All information is tunneled through the Gram matrix information bottleneck The real art is to pick an appropriate kernel 2 e g take the RBF kernel k x y e x y c if c is very small G I all data are dissimilar over fitting if c is very large G 1 all data are very similar under fitting We need to learn the kernel Here is some ways to combine kernels to improve them k1 cone a k1 x y b k2 x y k x y a b 0 k2 k x y k x y k x y 1 2 k1 F x F y k x y any positive polynomial 11 Stability of Kernel Algorithms Our objective for learning is to improve generalize performance cross validation Bayesian methods generalization bounds Call ES f x 0 a pattern a sample S Is this pattern also likely to be present in new data EP f x 0 We can use concentration inequalities McDiamid s theorem to prove that Theorem Let S x1 xl be a IID sample from P and define l the sample mean of f x as f 1 f xi then it follows that l i 1 R 1 P f EP f 2 2 ln 1 d d l R sup x f x 12 prob that sample mean and population mean differ less than is more than independent of P Rademacher Complexity Prolem we only checked the generalization performance for a single fixed pattern f x What is we want to search over a function class F Intuition we need to incorporate the complexity of this function class Rademacher complexity captures the ability of the function class to s i 1 fit random noise s i 1 uniform distributed f1 empirical RC 2 l Rl F Es sup s i f xi x1 xl f F l i 1 2 l Rl F ES Es sup s i f xi f F l i 1 xi 13 f2 Generalization Bound Theorem Let f be a function in F which maps to 0 1 e g loss functions Then with probability at least 1 d over random draws of size l every f satisfies 2 ln d E p f x Edata f x Rl F 2l 2 ln d Edata f x Rl F 3 2l Relevance The expected pattern E f 0 will also be present in a new data set if the last 2 terms are small Complexity function class F small 14 number of training data large Linear Functions in feature space FB f x w F x w B Consider the function class with and a sample S x1 xl Then the empirical RC of FB is bounded by l k x y F x F …