DOC PREVIEW
Columbia COMS W4115 - Spaniel – Span-based Information Extraction Language

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Spaniel – Span-based Information Extraction Language Adam Lally [[email protected]] COMS W4115 Programming Languages and Translators September 28, 2004 Introduction Spaniel is a new programming language designed to support programming tasks related to information extraction. In general terms, Information Extraction is the task of building structured databases from unstructured, natural-language text. One example would be identifying named entities such as persons, places, and organizations and determining relations between them, such as which persons are employed by which organizations. Spaniel is meant to allow average programmers to write simple information extraction programs as well as be a useful tool for the experts who practice in that field. Background It is common for an Information Extraction application to begin by annotating its raw input documents. That is, one or more components called annotators scan through the input document and identify spans of text which are labeled and assigned attributes. A labeled span is reffered to as an annotation. A simple annotator might take the input document: John Smith works for IBM. and annotate is as follows: <Person gender="male">John Smith</Person> works for <Organization type="corporation">IBM</Organization>. Note that the use of XML syntax is just a convenient notation for representing annotations on spans of text, and there is no fundamental requirement to use XML for this task. Annotators often build on the results of other annotators. For example, a second annotator might take the annotated text shown above and infer a WorksFor relation between John Smith and IBM. This could be recorded as an annotation over the entire sentence. Hence the annotation task can be defined as: Given a text document and some (possibly empty) set of annotations over spans of that document, produce a new set of annotations that represent additional information inferred from that document. A software component that performs this task is called an annotator. There are several approaches to tackling the annotation problem. Statistical annotators employ machine learning algorithms trained on human-annotated text, while rule-based annotators allow their users to declaratively specify rules for each type of annotation, which are then executed by a rule engine against each input document. These are both very active areas of current research, and have their merits. However, a currently underused approach is the procedural approach – that is, just directly writing an annotator using a procedural programming language.Implementing an annotator directly in code is certainly possible; however one often ends up needing to write similar code in each annotator one writes, for example deciding how annotations should be represented and efficiently accessed. These issues are not as much of an issue for statistical and rule-based annotators since a single piece of software, once written, can be reapplied in many situations by training it on new data or by supplying a new set of rules. One way to assist in the development of procedural annotators is to build a software framework that abstracts away some of these issues. In fact, this author is working on just such a project1. However, the Java code one writes to implement an annotator can still be somewhat repetitive – there are many patterns that reappear in each annotator one writes. This situation can be improved by building these patterns directly into the language. Goals Spaniel is Domain-Specific, Integrated with Java, Intuitive, and Compact, yet Readable. Domain Specific Spaniel is specifically designed to support the annotation task described above. The concept of a span, meaning a contiguous section of text, is central to the language, and spans can be manipulated with ease. Arithmetic operators can be applied to compute unions and intersections of spans. Spans can be assigned labels and attributes, and it is easy to get an iterator over spans meeting certain criteria. While Spaniel does provide a basic core of programming language capability, it is not intended to be a general purpose programming language that supplants Java or C++. Developers are expected to code annotation algorithms in Spaniel, and use Java for other aspects of their program. Integrated with Java Spaniel is an interpreted language that runs within a Java Virtual Machine. As such, it is very easy for a Java program to execute an annotation algorithm written in Spaniel as part of a larger Java application. What's more, the Spaniel language includes a way for a Spaniel program to make a call to a Java method. This makes the power of Java and its extensive class libraries accessible to Spaniel program, enabling the core annotation algorithm to be written in Spaniel while any complex computations are done in Java, where they belong. Intuitive It was decided that Spaniel should be a procedural language, because that is what average programmers know and can do well. While there are undoubtedly advantages to declarative and functional languages, in this author's experience average programmers are not comfortable or effective thinking in this way. 1 D. Ferrucci and A. Lally. “Building an example application with the Unstructured Information Management Architecture.” IBM Systems Journal, August 2004. http://www.research.ibm.com/journal/sj/433/ferrucci.pdf.Anyone familiar with Java can easily learn to write programs in Spaniel. The Spaniel syntax is very similar to Java and the new syntax is related only to the central concepts of spans and iterators over spans, which are easy to learn. Compact, Yet Readable Part of the reason for creating the Spaniel language was to reduce some of the boilerplate code that is necessary when implementing annotation algorithms in Java. The amount of code needed to obtain iterators over spans matching certain criteria is greatly reduced. Indeed, since spans are a built-in type in the language, the amount of code for many span-based operations is reduced. This leads to more compact programs, and also contributes to readability, since a reader does not have to sift through the repetitive boilerplate code to find the important parts. If compactness is made an end in itself, however, this gain in readability is soon dramatically reversed. Spaniel attempts to achieve compactness only where it increases readability. In particular, it was decided not


View Full Document

Columbia COMS W4115 - Spaniel – Span-based Information Extraction Language

Documents in this Course
YOLT

YOLT

13 pages

Lattakia

Lattakia

15 pages

EasyQL

EasyQL

14 pages

Photogram

Photogram

163 pages

Espresso

Espresso

27 pages

NumLang

NumLang

6 pages

EMPATH

EMPATH

14 pages

La Mesa

La Mesa

9 pages

JTemplate

JTemplate

238 pages

MATVEC

MATVEC

4 pages

TONEDEF

TONEDEF

14 pages

SASSi

SASSi

16 pages

JTemplate

JTemplate

39 pages

BATS

BATS

10 pages

Synapse

Synapse

11 pages

c.def

c.def

116 pages

TweaXML

TweaXML

108 pages

Load more
Download Spaniel – Span-based Information Extraction Language
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Spaniel – Span-based Information Extraction Language and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Spaniel – Span-based Information Extraction Language 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?