1Foundations of Artificial IntelligenceSupport Vector Machines and KernelsCS472 – Fall 2007Thorsten JoachimsOutline• Transform a linear learner into a non-linear learner• Kernels can make high-dimensional spaces tractable• Kernels can make non-vectorial data tractableNon-Linear ProblemsProblem:• some tasks have non-linear structure• no hyperplane is sufficiently accurateHow can SVMs learn non-linear classification rules?ÎExtending the Hypothesis SpaceIdea: add more featuresÎ Learn linear rule in feature space.Example:Î The separating hyperplane in feature space is degreetwo polynomial in input space.Example• Input Space: (2 attributes)• Feature Space:(6 attributes) Dual (Batch) Perceptron Algorithm2Dual SVM Optimization Problem• Primal Optimization Problem• Dual Optimization Problem• Theorem: If w* is the solution of the Primal and α* is the solution of the Dual, thenKernelsProblem: Very many Parameters! Polynomials of degree p over N attributes in input space lead to attributes in feature space!Solution: [Boser et al.] The dual OP depends only on inner products => Kernel FunctionsExample: For calculating computes inner product in feature space.Î no need to represent feature space explicitly.SVM with KernelTraining:Classification:New hypotheses spaces through new Kernels:• Linear:• Polynomial:• Radial Basis Function:• Sigmoid:Examples of KernelsPolynomial Radial Basis FunctionKernels for Non-Vectorial Data• Applications with Non-Vectorial Input Data Æ classify non-vectorial objects– Protein classification (x is string of amino acids)– Drug activity prediction (x is molecule structure)– Information extraction (x is sentence of words)–Etc.• Applications with Non-Vectorial Output DataÆ predict non-vectorial objects– Natural Language Parsing (y is parse tree)– Noun-Phrase Co-reference Resolution (y is clustering)– Search engines (y is ranking)Î Kernels can compute inner products efficiently!Properties of SVMs with Kernels• Expressiveness– Can represent any boolean function (for appropriate choice of kernel)– Can represent any sufficiently “smooth” function to arbitrary accuracy (for appropriate choice of kernel)• Computational– Objective function has no local optima (only one global)– Independent of dimensionality of feature space• Design decisions– Kernel type and parameters– Value of
View Full Document