Introduction to KernelsIntroductionLet’s Learn SomethingFeature SpacesRidge Regression (duality)Kernel TrickModularityWhat is a proper kernelReproducing Kernel Hilbert SpacesMercer’s TheoremLearning KernelsStability of Kernel AlgorithmsRademacher ComplexityGeneralization BoundLinear Functions (in feature space)Margin Bound (classification)1Introduction to KernelsMax WellingOctober 1 2004(chapters 1,2,3,4)2Introduction• What is the goal of (pick your favorite name): - Machine Learning - Data Mining - Pattern Recognition - Data Analysis - StatisticsAutomatic detection of non-coincidental structure in data.• Desiderata: - Robust algorithms insensitive to outliers and wrong model assumptions. - Stable algorithms: generalize well to unseen data. - Computationally efficient algorithms: large datasets.3Let’s Learn SomethingFind the common characteristic (structure) among the followingstatistical methods?1. Principal Components Analysis2. Ridge regression3. Fisher discriminant analysis4. Canonical correlation analysisAnswer: We consider linear combinations of input vector:Linear algorithm are very well understood and enjoy strong guarantees.(convexity, generalization bounds).Can we carry these guarantees over to non-linear algorithms? ( )Tf x w x=4Feature Spaces: ( ),dx x R FF � F �non-linear mapping to F 1. high-D space2. infinite-D countable space :3. function space (Hilbert space)2L2 2( , ) ( , , 2 )x y x y xy�example:F5Ridge Regression (duality)2 21min ( ) || ||Tw i iiy w x wl=- +�lregularizationtargetinput1111( )( )( ) ,T TdT TTij i ji iiw X X I X yX XX I yX G I y G x xxllla---== += += + =< >=�lllproblem:solution:dxd inverseinverseGram-matrixDual Representationlinear comb. data�l l6Kernel TrickNote: In the dual representation we used the Gram matrix to express the solution.Kernel Trick: Replace : ( ),, ( ), ( ) ( , )ij i j ij i j i jx xG x x G x x K x xF� F=< >� =<F F >=kernelIf we use algorithms that only depend on the Gram-matrix, G,then we never have to know (compute) the actual features FThis is the crucial point of kernel methods7ModularityKernel methods consist of two modules:1) The choice of kernel (this is non-trivial)2) The algorithm which takes kernels as inputModularity: Any kernel can be used with any kernel-algorithm.some kernels:2( || || / )2 2( , )( , ) ( , )( , ) tanh( , )1( , )|| ||x y cdk x y ek x y x yk x y x yk x yx y cqa q- -== < >+= < >+=- +some kernel algorithms:- support vector machine- Fisher discriminant analysis- kernel regression- kernel PCA- kernel CCA8What is a proper kernelDefinition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formedby restriction on any finite subset of points is positive semi-definite. :k x y R� �0TKa a a� "Theorem: A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property. :k x y R� �( , ) ( ), ( )k x y x y=<F F >( )xF( )x x F� F �Relevance: We can now check if k(x,y) is a proper kernel usingonly properties of k(x,y) itself, i.e. without the need to know the feature map!9Reproducing Kernel Hilbert SpacesThe proof of the above theorem proceeds by constructing a veryspecial feature map (note that more feature maps may give rise to a kernel): ( ) ( ,.)x x k xF � F =i.e. we map to a function space.11 11 1(.) ( ,.) ,{ }, ( , ), ( , ) 0( )mi i iimi j i ji jmi j i ji jf k x any m xf g k x xf f k x xfinite positive semi definiteaa ba a== == ==< >=< >= �-�����ll11, ( ) , ( ,.)( ,.), ( ,.)( , ) ( )( ), ( ) ( , )ki iiki iif x f k xk x k xk x x f xx y k x yaa==< F >=< >=< >==� <F F >=��definition function space: reproducing property:10Mercer’s TheoremTheorem: X is compact, k(x,y) is symmetric continuous function s.t. is a positive semi-definite operator: i.e.then there exists an orthonormal feature basis of eigen-functions such that:Hence: k(x,y) is a proper kernel. 2( , ) ( ) ( ) 0 ( )k x y f x f y dxdy f L X� " ���1( , ) ( ) ( )i jik x y x y�== F F�0kT �(., ) ( )kT f k x f x dx�BNote: Here we construct feature vectors in L2, where the RKHS construction was in a function space.11Learning Kernels• All information is tunneled through the Gram-matrix information bottleneck.• The real art is to pick an appropriate kernel. e.g. take the RBF kernel:2( || || / )( , )x y ck x y e- -=if c is very small: G=I (all data are dissimilar): over-fittingif c is very large: G=1 (all data are very similar): under-fittingWe need to learn the kernel. Here is some ways to combine kernels to improve them:1 21 21( , ) ( , ) ( , ) , 0( , ) ( , ) ( , )( ( ), ( )) ( , )k x y k x y k x yk x y k x y k x yk x y k x ya b a b+ = �=F F =k1k2coneany positivepolynomial12Stability of Kernel AlgorithmsCall a pattern a sample S.Is this pattern also likely to be present in new data: ?[ ( )] 0SE f x �[ ( )] 0PE f x �We can use concentration inequalities (McDiamid’s theorem)to prove that:Theorem: Let be a IID sample from P and definethe sample mean of f(x) as: then it follows that: 1{ ,..., }S x x=l11( )iif f x==�ll1(|| [ ]|| (2 2ln( )) 1PRP f E f dd- � + � -l(prob. that sample mean and population mean differ less than is more than ,independent of P! sup || ( ) ||xR f x=Our objective for learning is to improve generalize performance:cross-validation, Bayesian methods, generalization bounds,...13Rademacher ComplexityProlem: we only checked the generalization performance for a single fixed pattern f(x). What is we want to search over a function class F?Intuition: we need to incorporate the complexity of this function class.1112( ) [sup | ( ) |,| ,..., ]2( ) [sup | ( ) |]i if FiS i if FiR F E f x x xR F E E f xssss�=�===��ll lll)llRademacher complexity captures the ability of the function class tofit random noise. ( uniform distributed)xi1is =�f1f2(empirical RC)1is =�14Generalization BoundTheorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)Then, with probability at least over random draws of size every f satisfies:1 d-l2ln( )[ ( )] [ ( )] ( )22ln( )[ ( )] ( ) 32p datadataE f x E f x R FE f x R Fdd� + +� +
or
We will never post anything without your permission.
Don't have an account? Sign up