Stanford CS 224 - Lecture Notes - D1959590

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Lecture Notes

DOC PREVIEW

Stanford CS 224 - Lecture Notes

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Coreference Resolution with Decision Tree CS224N: Final Project Stanford University, Spring 2008 Sanghoon Kwak Takahiro Aoyama [email protected] [email protected] Abstract Coreference resolution is the task to determine whether two expressions in text refer to the same entity. In this paper, we present an approach to coreference resolution of noun phrases of newswire based on machine learning approach with decision tree. We designed 12 features such as plurality and gender, and modified the C4.5 decision tree builder to generate a decision tree based on our training data. We trained and evaluated our approach on the Automatic Content Extraction (ACE) 2004 dataset and achieved encouraging results. 1. Introduction Coreference Resolution, which is also known as Anaphora Resolution, is a much heralded topic in natural language processing. Coreference resolution is one of essential techniques used in many areas such as information extraction (IE) and question and answering systems. Thus, the coreference resolution problem is being tackled by many NLP researchers and various approaches have been proposed. For example, in the early 90s, Aone and Bonnet (1993) built a decision tree based on annotated Japanese news articles, focusing on zero-anaphora. Also, in 1995, Mccarthy, Lehnert and RESOLVE improved another decision tree approach that concentrated on business-related data. Recently, Wee Meng Soon et al. developed a machine learning approach for coreference resolution of noun phrases (2001). Moreover, Gildea and Jurafsky designed technique for semantic role labeling for coreference resolution (2005). In this paper, we first provide details of parsing a large training data from news articles in the ACE (Automatic Content Extraction) data set. We use various features of noun phrases to build this decision tree training set. In the next section, we describe the training and testing results of our parsed data set. For the training, we used the C4.5 decision tree builder that has been modified to suit our objectives. 2. Parser In this section, I would like to describe the ACE dataset we used for this project and the features of noun phrases that we extracted from the dataset. 2.1. ACE 2004 Data Set The ACE dataset from the 2004 corpus was used for this project. The ACE data set has been designed to help people to develop automatic content extraction algorithms. The dataset is consisted of separate XML files for each news article, and each XML file provides information for each noun phrases such asposition in the article, semantic class, and other coreferent noun phrases. The ACE data set that we used is consisted of 100 news articles from the Broadcast News Program and the Newswire of AP and NYTimes. The following is a snippet from the XML data file. <Figure 1. Sample data> - ENTITY ID specifies the ID of all noun phrases that coreferences each other. - TYPE specifies the semantic class of noun phrases under the current entity ID. - ENTITY MENTION specifies a noun phrase under the current entity ID. More than one entity mention means that there are at least one pair of coreferent noun phrases. For example, in the above data, “Texas Gov. George W. Bush” and “the republican presidential nominee” are coreferent. 2.2. Selected Features Based on this corpus, we extracted and designed 12 features in order to check if the antecedent noun phrase REi is coreferent to the noun phrase REj. Few of the features include word distance, gender match, plurality match, and so on. The full list of features are listed and discussed in detail below.  Distance Feature This feature denotes the distance between REi and REj by the number of sentences that separate the two noun phrases. We extracted this feature by first searching REi and REj and counting the number of stop marks between them. Since there were cases such as “Oct. 10” and “Mr. Bush” where the stop mark did not actually end a sentence, we handpicked several patterns in which the stop mark should not be considered as the end of the sentence. <entity ID="APW20001018.1350.0453-E1" TYPE="PER" CLASS="SPC"> <entity_mention ID="1-3" TYPE="NAM" LDCTYPE="NAM" LDCATR="TRUE"> <extent> <charseq START="832" END="856">Texas Gov. George W. Bush</charseq> </extent> <head> <charseq START="843" END="856">George W. Bush</charseq> </head> </entity_mention> <entity_mention ID="1-4" TYPE="NOM" LDCTYPE="NOM" LDCATR="TRUE"> <extent> <charseq START="859" END="893">the Republican presidential nominee</charseq> </extent> <head> <charseq START="887" END="893">nominee</charseq> </head> </entity_mention> IsPronoun Feature This feature is set to true if a noun phrase is a pronoun. We compared the noun phrase with possible pronouns such as personal pronouns (he, him, you), possessive pronouns (your, her), and reflexive pronouns (yourself, herself). This feature was extracted for both REi and REj.  String Match Feature This feature is set to true if REi and REj have the same character sequence.  Definite NP Feature This feature checks if both noun phrases are definite nouns. We basically check if both noun phrases begin with the word “the”.  Demonstrative NP Feature This feature checks if both noun phrases are demonstrative. If REj is a noun phrase which starts with articles a or an, and demonstrative pronouns (this, that, these, those) this feature is set to true.  Number Agreement Feature This feature checks if both noun phrases are plural or singular. To determine plurality or singularity of a noun phrase, we first check if it is a pronoun. If so, we compare it with the list of known plural and singular pronouns. If the noun phrase is a proper noun, we simply determine its plurality by checking if it ends with “s”. Otherwise, we have to determine the noun phrase’s plurality by determining the morphological root of the noun. For morphological analysis, we utilized PCKIMMO, an open source software which, given a lexicon, grammar, and rules, determines the morphological root of a word.  Semantic Class Agreement Feature This feature checks if both of REi and REj belong to the same semantic class. The class information is directly

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Stanford CS 224 - Lecture Notes

Sign up for free to view:

Please select your school