UMD LBSC 796 - Information Extraction Supported Question Answering - D1019153

Home> Schools> University of Maryland, College Park> Library Science (LBSC) > LBSC 796> Information Extraction Supported Question Answering

UMD LBSC 796 - Information Extraction Supported Question Answering

School name University of Maryland, College Park

Course Lbsc 796- Information Retrieval Systems

Pages 12

Download Save

Unformatted text preview:

Information Extraction Supported Question Answering*Rohini Srihari and Wei LiCymfony Inc.5500 Main StreetWilliamsville, NY 14221, [email protected], [email protected]: (716) 565-9114 fax: (716) 565-030815 October, 1999AbstractThis paper discusses the use of our information extraction (IE) system, Textract, in the question-answering (QA) track of the recently held TREC-8 tests. One of our major objectives is to examinehow IE can help IR (Information Retrieval) in applications like QA. Our study shows: (i) IE canprovide solid support for QA; (ii) low-level IE like Named Entity tagging is often a necessarycomponent in handling most types of questions; (iii) a robust natural language shallow parserprovides a structural basis for handling questions; (iv) high-level domain independent IE, i.e.extraction of multiple relationships and general events, is expected to bring about a breakthroughin QA.1 IntroductionNatural language QA (Question Answering) is an ideal test bed for demonstrating the power of IE(Information Extraction). In our vision, there is a natural co-operation between IE and IR(Information Retrieval).An important question then is, what type of IE can support IR in QA and how well does it supportit? This forms the major topic of this paper. We structure the remaining part of the paper asfollows. In Section 2, we first give an overview of the underlying IE technology that Cymfony hasbeen developing. We then present in Section 3 the use of this technology to implement theprototype for the QA Track. In Section 4, we examine question types and discuss their relationshipwith IE tasks. Finally, in Section 5, we propose a more sophisticated QA system supported by 3levels of IE. * This work was supported in part by the following grants from the Air Force, Rome Laboratories: AFRL/IFKRD Phase2 (Contract No. F30602-98-C-0043) and AFRL/IFKRD Phase 1 (Contract No. F30602-99-C-0102).2 Overview of Textract IEThe last decade has seen great advances and interest in the area of IE. In the US, the DARPAsponsored Tipster Text Program [Grishman 1997] and the Message Understanding Conferences(MUC) [MUC-7 1998] have been the driving force for developing this technology. In fact, theMUC specifications for various IE tasks have become de facto standards in the IE researchcommunity. It is therefore necessary to present our IE effort in the context of the MUC program.MUC divides IE into distinct tasks, namely, NE (Named Entity), TE (Template Element), TR(Template Relation), CO (Co-reference), and ST (Scenario Templates) [Chinchor & Marsh 1998].Our proposal for three levels of IE is modeled after the MUC standards using MUC-stylerepresentation. However, we have modified the MUC IE task definitions in order to make themmore useful and more practical.More precisely, we propose a hierarchical, three-level architecture for developing a kernel IEsystem which is domain-independent throughout.In fact, for level-1 IE, Cymfony has already developed Textract 1.0, a state-of-the-art NE tagger[Srihari 1998]. Textract 1.0 has obtained a score of 91.24% in combined precision and recall (i.e. F-measures), when tested on the MUC-7 dry run data using the MUC-provided scorer. Our taggingspeed, approximately 100 MB/hour on a Pentium system, is also comparable to that of the fewdeployed NE systems, like NetOwl [Krupka & Hausman 1998] and Nymble [Bikel et al 1997].It is to be noted that, in our definition of NE, we significantly expanded the type of information tobe extracted. In addition to all the MUC defined NE types (person, organization, location, time,date, money and percent), the following entities are also identified by our existing NE tagger:• duration, frequency, age• number, fraction, decimal, ordinal, math equation• weight, length, temperature, angle, area, capacity, speed, rate• product, software• address, email, phone, fax, telex, www• name (default, i.e. proper name which does not belong to any of the above category)Sub-type information like company, government agency, school (belonging to the typeorganization) and military person, religious person (belonging to person) are also identified. Thesenew types or sub-types of named entities provide a better foundation for defining multiplerelationships between the identified entities and for supporting question answering functionality.For example, the key to a question processor is to identify the asking point (who, what, when,where, etc.). In many cases, the asking point corresponds to an NE beyond the MUC definition, e.g.the how-type questions: how long (duration or length depending on the question context), how far(length), how often (frequency), how old (age), etc. Therefore, an extended NE tagset is helpful forsophisticated IE and QA.Leve-2 IE, or CE (Correlated Entity), is concerned with extracting pre-defined multiplerelationships between the entities. This represents a giant step forward from existing deployed IEsystems such as NetOwl, IdentiFinder [MUC-7 1998] as well as Cymfony Textract 1.0, which onlyoutput isolated named entities. With CE extraction, salient information is made available to a usersince individual, isolated named entities are inter-related. Cymfony has recently implemented a CEprototype. Consider the person entity for example; our CE prototype Textract 2.0 is capable ofextracting the following key relationships:• name: including aliases• title: e.g. Mr.; Prof; etc.• subtype: e.g. MILITARY; RELIGIOUS; etc• age:• gender: e.g. MALE; FEMALE• affiliation:• position:• birth_time:• birth_place:• spouse:• parents:• children:• where_from:• address:• phone:• fax:• email:• descriptors:As shown, the information in the CE represents a mini-CV of the person. In general, our CEtemplate integrates and greatly enriches the information contained in MUC TE and TR. In terms ofrelationships, there are only a couple of relationships (employee_of, location_of) defined in MUCTR.The final goal of our IE effort is to further extract open-ended general events (GE, or level 3 IE) forinformation like WHO did WHAT (to WHOM) WHEN and WHERE. By general events, we refer toargument structures centering around verb notions plus the associated information of time andlocation. GE is dramatically different from the MUC ST task because it is open-ended and domainindependent, while ST is pre-defined and highly domain dependent.Currently, Cymfony is n the

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD LBSC 796 - Information Extraction Supported Question Answering

Sign up for free to view:

Please select your school