Introduction to Kernels

Home> Academic Documents> Introduction to Kernels

DOC PREVIEW

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Introduction to KernelsIntroductionLet’s Learn SomethingFeature SpacesRidge Regression (duality)Kernel TrickModularityWhat is a proper kernelReproducing Kernel Hilbert SpacesMercer’s TheoremLearning KernelsStability of Kernel AlgorithmsRademacher ComplexityGeneralization BoundLinear Functions (in feature space)Margin Bound (classification)1Introduction to KernelsMax WellingOctober 1 2004(chapters 1,2,3,4)2Introduction• What is the goal of (pick your favorite name): - Machine Learning - Data Mining - Pattern Recognition - Data Analysis - StatisticsAutomatic detection of non-coincidental structure in data.• Desiderata: - Robust algorithms insensitive to outliers and wrong model assumptions. - Stable algorithms: generalize well to unseen data. - Computationally efficient algorithms: large datasets.3Let’s Learn SomethingFind the common characteristic (structure) among the followingstatistical methods?1. Principal Components Analysis2. Ridge regression3. Fisher discriminant analysis4. Canonical correlation analysisAnswer: We consider linear combinations of input vector:Linear algorithm are very well understood and enjoy strong guarantees.(convexity, generalization bounds).Can we carry these guarantees over to non-linear algorithms? ( )Tf x w x=4Feature Spaces: ( ),dx x R FF � F �non-linear mapping to F 1. high-D space2. infinite-D countable space :3. function space (Hilbert space)2L2 2( , ) ( , , 2 )x y x y xy�example:F5Ridge Regression (duality)2 21min ( ) || ||Tw i iiy w x wl=- +�lregularizationtargetinput1111( )( )( ) ,T TdT TTij i ji iiw X X I X yX XX I yX G I y G x xxllla---== += += + =< >=�lllproblem:solution:dxd inverseinverseGram-matrixDual Representationlinear comb. data�l l6Kernel TrickNote: In the dual representation we used the Gram matrix to express the solution.Kernel Trick: Replace : ( ),, ( ), ( ) ( , )ij i j ij i j i jx xG x x G x x K x xF� F=< >� =<F F >=kernelIf we use algorithms that only depend on the Gram-matrix, G,then we never have to know (compute) the actual features FThis is the crucial point of kernel methods7ModularityKernel methods consist of two modules:1) The choice of kernel (this is non-trivial)2) The algorithm which takes kernels as inputModularity: Any kernel can be used with any kernel-algorithm.some kernels:2( || || / )2 2( , )( , ) ( , )( , ) tanh( , )1( , )|| ||x y cdk x y ek x y x yk x y x yk x yx y cqa q- -== < >+= < >+=- +some kernel algorithms:- support vector machine- Fisher discriminant analysis- kernel regression- kernel PCA- kernel CCA8What is a proper kernelDefinition: A finitely positive semi-definite function is a symmetric function of its arguments for which matrices formedby restriction on any finite subset of points is positive semi-definite. :k x y R� �0TKa a a� "Theorem: A function can be written as where is a feature map iff k(x,y) satisfies the semi-definiteness property. :k x y R� �( , ) ( ), ( )k x y x y=<F F >( )xF( )x x F� F �Relevance: We can now check if k(x,y) is a proper kernel usingonly properties of k(x,y) itself, i.e. without the need to know the feature map!9Reproducing Kernel Hilbert SpacesThe proof of the above theorem proceeds by constructing a veryspecial feature map (note that more feature maps may give rise to a kernel): ( ) ( ,.)x x k xF � F =i.e. we map to a function space.11 11 1(.) ( ,.) ,{ }, ( , ), ( , ) 0( )mi i iimi j i ji jmi j i ji jf k x any m xf g k x xf f k x xfinite positive semi definiteaa ba a== == ==< >=< >= �-��ll11, ( ) , ( ,.)( ,.), ( ,.)( , ) ( )( ), ( ) ( , )ki iiki iif x f k xk x k xk x x f xx y k x yaa==< F >=< >=< >==� <F F >=��definition function space: reproducing property:10Mercer’s TheoremTheorem: X is compact, k(x,y) is symmetric continuous function s.t. is a positive semi-definite operator: i.e.then there exists an orthonormal feature basis of eigen-functions such that:Hence: k(x,y) is a proper kernel. 2( , ) ( ) ( ) 0 ( )k x y f x f y dxdy f L X� " ��1( , ) ( ) ( )i jik x y x y�== F F�0kT �(., ) ( )kT f k x f x dx�BNote: Here we construct feature vectors in L2, where the RKHS construction was in a function space.11Learning Kernels• All information is tunneled through the Gram-matrix information bottleneck.• The real art is to pick an appropriate kernel. e.g. take the RBF kernel:2( || || / )( , )x y ck x y e- -=if c is very small: G=I (all data are dissimilar): over-fittingif c is very large: G=1 (all data are very similar): under-fittingWe need to learn the kernel. Here is some ways to combine kernels to improve them:1 21 21( , ) ( , ) ( , ) , 0( , ) ( , ) ( , )( ( ), ( )) ( , )k x y k x y k x yk x y k x y k x yk x y k x ya b a b+ = �=F F =k1k2coneany positivepolynomial12Stability of Kernel AlgorithmsCall a pattern a sample S.Is this pattern also likely to be present in new data: ?[ ( )] 0SE f x �[ ( )] 0PE f x �We can use concentration inequalities (McDiamid’s theorem)to prove that:Theorem: Let be a IID sample from P and definethe sample mean of f(x) as: then it follows that: 1{ ,..., }S x x=l11( )iif f x==�ll1(|| [ ]|| (2 2ln( )) 1PRP f E f dd- � + � -l(prob. that sample mean and population mean differ less than is more than ,independent of P! sup || ( ) ||xR f x=Our objective for learning is to improve generalize performance:cross-validation, Bayesian methods, generalization bounds,...13Rademacher ComplexityProlem: we only checked the generalization performance for a single fixed pattern f(x). What is we want to search over a function class F?Intuition: we need to incorporate the complexity of this function class.1112( ) [sup | ( ) |,| ,..., ]2( ) [sup | ( ) |]i if FiS i if FiR F E f x x xR F E E f xssss�=�===��ll lll)llRademacher complexity captures the ability of the function class tofit random noise. ( uniform distributed)xi1is =�f1f2(empirical RC)1is =�14Generalization BoundTheorem: Let f be a function in F which maps to [0,1]. (e.g. loss functions)Then, with probability at least over random draws of size every f satisfies:1 d-l2ln( )[ ( )] [ ( )] ( )22ln( )[ ( )] ( ) 32p datadataE f x E f x R FE f x R Fdd� + +� +


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 16 pages.

Please select your school