WSU CSE 6363 - On kernel Methods for Relational Learning - D3052385

Home> Schools> Washington State University> (CSE) > CSE 6363> On kernel Methods for Relational Learning

WSU CSE 6363 - On kernel Methods for Relational Learning

Pages 8

Download Save

Unformatted text preview:

On Kernel Methods for Relational LearningChad Cumby [email protected] Roth [email protected] of Computer Science University of Illinois, Urbana, IL 61801 USAAbstractKernel methods have gained a great deal ofpopularity in the machine learning commu-nity as a method to learn indirectly in high-dimensional feature spaces. Those interestedin relational learning have recently begun tocast learning from structured and relationaldata in terms of kernel operations.We describe a general family of kernel func-tions built up from a description language oflimited expressivity and use it to study thebenefits and drawbacks of kernel learning inrelational domains. Learning with kernels inthis family directly models learning over anexpanded feature space constructed using thesame description language. This allows us toexamine issues of time complexity in terms oflearning with these and other relational ker-nels, and how these relate to generalizationability. The tradeoffs between using kernelsin a very high dimensional implicit space ver-sus a restricted feature space, is highlightedthrough two experiments, in bioinformaticsand in natural language processing.1. IntroductionRecently, much interest has been generated in themachine learning community on the subject of learn-ing from relational and structured data via proposi-tional learners (Kramer et al., 2001; Cumby & Roth,2003a). Examples of relational learning problemsinclude learning to identify functional phrases andnamed entities from structured parse trees in natu-ral language processing (NLP), learning to classifymolecules for mutagenicity from atom-bond data indrug design, and learning a policy to map goals to ac-tions in planning domains. At the same time, workon SVMs and Perceptron type algorithms has gener-ated interest in kernel methods to simulate learningin high-dimensional feature spaces while working withthe original low-dimensional input data.Haussler’s work on convolution kernels (Haussler,1999) introduced the idea that kernels could be builtto work with discrete data structures iteratively fromkernels for smaller composite parts. These kernels fol-lowed the form of a generalized sum over products – ageneralized convolution. Kernels were shown for sev-eral discrete datatypes including strings and rootedtrees, and more recently (Collins & Duffy, 2002) devel-oped kernels for datatypes useful in many NLP tasks,demonstrating their usefulness with the Voted Percep-tron algorithm (Freund & Schapire, 1998).While these past examples of relational kernels are for-mulated separately to meet each problem at hand, weseek to develop a flexible mechanism for building ker-nel functions for many structured learning problems -based on a unified knowledge representation. At theheart of our approach is a definition of a relationalkernel that is specified in a “syntax-driven” mannerthrough the use of a description language. (Cumby& Roth, 2002) introduced a feature description lan-guage and have shown how to use propositional clas-sifiers to successfully learn over structured data, andproduce relational representation, in the sense that dif-ferent data instantiations yield the same features andhave the same weights in the linear classifier learned.There, as in (Roth & Yih, 2001), this was done bysignificantly blowing up the relational feature-space.Building on the abovementioned description languagebased approach, this paper develops a correspondingfamily of parameterized kernel functions for structureddata. In conjunction with an SVM or a Perceptron-like learning algorithm, our parameterized kernels cansimulate the exact features generated in the blown upspace to learn a classifier, directly from the originalstructured data. From among several ways to definethe distance between structured domain elements wefollow (Khardon et al., 2001) in choosing a definitionthat provides exactly the same classifiers produced byPerceptron, if we had run it on the blown up discretefeature space, rather than directly on the structureddata. The parameterized kernel allows us to flexiblyProceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.define features over structures (or, equivalently, a met-ric over structures). At the same time, it allows us tochoose the degree to which we want to restrict the sizeof the expanded space, which affects the degree of ef-ficiency gained or lost - as well as the generalizationperformance of the classifier - as a result of using it.Along these lines, we then study time complexity andgeneralization tradeoffs between working in the ex-panded feature space and using the corresponding ker-nels, and between different kernels, in terms of theexpansions they correspond to. We show that, whilekernel methods provide an interesting and often morecomprehensible way to view the feature space, compu-tationally, it is often not recommended to use kernelsover structured data. This is especially clear when thenumber of examples is fairly large relative to the datadimensionality.The remainder of the paper is organized as follows:We first introduce the use of kernel functions inlinear classification and the kernel Perceptron algo-rithm (Khardon et al., 2001). Sec. 3 presents our ap-proach to relational features, which form the higher-dimensional space we seek to simulate. Sec. 4 intro-duces our kernel function for relational data. Sec. 5discusses complexity tradeoffs surrounding the ker-nel Perceptron and standard feature-based Perceptronand the issue of generalization. In Sec. 6 we validateour claims by applying the kernel Perceptron algo-rithm with our enhanced kernel to two problems wherea structured feature space is essential.2. Kernels & Kernel PerceptronMost machine learning algorithms make use of featurebased representations. A domain element is trans-formed into a collection of Boolean features, and a la-beled example is given by < x, l >∈ {0, 1}n× {−1, 1}.In recent years, it has become very common, espe-cially in large scale applications in NLP and computervision, to use learning algorithms such as variationsof Perceptron and Winnow (Novikoff, 1963; Little-stone, 1988) that use representations that are linearover their feature space (Roth, 1998). In these cases,working with features that directly represent domainelements may not be expressive enough and there arestandard ways of enhancing the capabilities of such al-gorithms. A typical way which has been used in theNLP

View Full Document


School:
Email:
New Password:
Confirm Password:

WSU CSE 6363 - On kernel Methods for Relational Learning

Sign up for free to view:

Please select your school