MIT 6 891 - Research Paper - D87156

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 891> Research Paper

MIT 6 891 - Research Paper

School name Massachusetts Institute of Technology

Course 6 891- Advanced Topics in Theoretical Computer Science

Pages 20

Download Save

Unformatted text preview:

An Algorithm that Learns What’s in a NameDANIEL M. BIKEL†[email protected] SCHWARTZ [email protected] M. WEISCHEDEL*[email protected] Systems & Technologies, 70 Fawcett Street, Cambridge MA 02138Telephone: (617) 873-3496Running head: What’s in a NameKeywords: named entity extraction, hidden Markov modelsAbstract. In this paper, we present IdentiFinder™, a hidden Markov model that learns to recognize andclassify names, dates, times, and numerical quantities. We have evaluated the model in English (based on datafrom the Sixth and Seventh Message Understanding Conferences [MUC-6, MUC-7] and broadcast news) and inSpanish (based on data distributed through the First Multilingual Entity Task [MET-1]), and on speech input(based on broadcast news). We report results here on standard materials only to quantify performance on dataavailable to the community, namely, MUC-6 and MET-1. Results have been consistently better than reportedby any other learning algorithm. IdentiFinder’s performance is competitive with approaches based onhandcrafted rules on mixed case text and superior on text where case information is not available. We alsopresent a controlled experiment showing the effect of training set size on performance, demonstrating that aslittle as 100,000 words of training data is adequate to get performance around 90% on newswire. Although wepresent our understanding of why this algorithm performs so well on this class of problems, we believe thatsignificant improvement in performance may still be possible.1. The Named Entity Problem and Evaluation1.1. The Named Entity TaskThe named entity task is to identify all named locations, named persons, namedorganizations, dates, times, monetary amounts, and percentages in text (see Figure 1.1).Though this sounds clear, enough special cases arise to require lengthy guidelines, e.g., whenis The Wall Street Journal an artifact, and when is it an organization? When is White Housean organization, and when a location? Are branch offices of a bank an organization? Is astreet name a location? Should yesterday and last Tuesday be labeled dates? Is mid-morninga time? In order to achieve human annotator consistency, guidelines with numerous specialcases have been defined for the Seventh Message Understanding Conference, MUC-7(Chinchor, 1998). † Daniel M. Bikel’s current address is Department of Computer & Information Science, University ofPennsylvania, 200 South 33rd Street, Philadelphia, PA 19104.* Please address correspondence to this author.D. M. BIKEL, ET AL. 2 WHAT’S IN A NAMEThe delegation, which included the commander of the U .N. troops in Bosnia, Lt. Gen. SirMichael Rose , went to the Serb stronghold of P ale, near S arajevo, for talks with BosnianSerb leader Radovan Karadzic .Este ha sido el primer comentario publico del presidente Clinton respecto a la crisis deO riente Medio desde que el secretario de Estado, Warren Christopher , decidiera regresarprecipitadamente a W ashington para impedir la ruptura del proceso de paz tras la violenciadesatada en el sur de L ibano.1. L ocations2. Persons 3. O rganizationsFigure 1.1 Examples. Examples of correct labels for English text and for Spanish text.Both the boundaries of an expression and its label must be marked. The StandardGeneralized Markup Language, or SGML, is an abstract syntax for marking information andstructure in text, and is therefore appropriate for named entity mark-up. Various GUIs tosupport manual preparation of answer keys are available.1.2. Evaluation MetricA computer program is used to evaluate the performance of a name-finder, called a “scoringprogram”. The scoring program developed for the MUC and Multilingual Entity Task(MET) evaluations measures both precision (P) and recall (R), terms borrowed from theinformation-retrieval community, wherePnumber of correct responsesnumber of responses= and Rnumber of correct responsesnumber correct in key=. (1.1)(The term response is used to denote “answer delivered by a name-finder”; the term key orkey file is used to denote “an annotated file containing correct answers”.) Put informally,recall measures the number of “hits” vs. the number of possible correct answers as specifiedin the key, whereas precision measures how many answers were correct ones compared tothe number of answers delivered. These two measures of performance combine to form onemeasure of performance, the F-measure, which is computed by the uniformly weightedharmonic mean of precision and recall:FRPR P=+12( ). (1.2)In MUC and MET, a correct answer from a name-finder is one where the label and bothboundaries are correct. There are three types of labels, each of which use an attribute tospecify a particular entity. Label types and the entities they denote are defined as follows:1. entity (ENAMEX): person, organization, location2. time expression (TIMEX): date, time3. numeric expression (NUMEX): money, percent.A response is half correct if the label (both type and attribute) is correct but only oneboundary is correct. Alternatively, a response is half-correct if only the type of the label (andD. M. BIKEL, ET AL. 3 WHAT’S IN A NAMEnot the attribute) and both boundaries are correct. Automatic scoring software is available,as detailed in Chinchor (1998).2. Why2.1. Why the Named Entity (NE) ProblemFirst and foremost, we chose to work on the named entity (NE) problem because it seemedboth to be solvable and to have applications. The NE problem has generated much interest,as evidenced by its inclusion as an understanding task to be evaluated in both the Sixth andSeventh Message Understanding Conferences (MUC-6 and MUC-7) and in the First andSecond Multilingual Entity Task evaluations (MET-1 and MET-2). Furthermore, at least onecommercial product has emerged: NameTag™ from IsoQuest. The NE task had beendefined by a set of annotator guidelines, an evaluation metric and example data (Sundheim &Chinchor, 1995).1. MATSUSHITA ELECTRIC INDUSTRIAL CO . HAS REACHED AGREEMENT …2. IF ALL GOES WELL, MATSUSHITA AND ROBERT BOSCH WILL …3. VICTOR CO. OF JAPAN ( JVC ) AND SONY CORP. …4. IN A FACTORY OF BLAUPUNKT WERKE , A ROBERT BOSCH SUBSIDIARY , …5. TOUCH PANEL SYSTEMS , CAPITALIZED AT 50 MILLION YEN, IS OWNED …6. MATSUSHITA EILL DECIDE ON THE PRODUCTION SCALE. …Figure 2.1 English Examples. Finding

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 891 - Research Paper

Sign up for free to view:

Please select your school