MIT 6 863J - AUTOMATIC ACQUISITION - D1612551

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> AUTOMATIC ACQUISITION

DOC PREVIEW

MIT 6 863J - AUTOMATIC ACQUISITION

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A U T O M A T I C ACQU IS IT IO N OF SU B C A T E G O R I Z A T I O N F R A M ES FR O M UN T AG G ED T E X T Michael R. Brent MIT AI Lab 545 Technology Square Cambridge, Massachusetts 02139 [email protected] A B STRACT This paper describes an implemented program that takes a raw, untagged text corpus as its only input (no open-class dictionary) and gener- ates a partial list of verbs occurring in the text and the subcategorization frames (SFs) in which they occur. Verbs are detected by a novel tech- nique based on the Case Filter of Rouvret and Vergnaud (1980). The completeness of the ou t p ut list increases monotonically with the total number of occurrences of each verb in the corpus. False positive rates are one to three percent of observa- tions. Five SFs are currently detected and more are planned. Ultimately, I expect to provide a large SF dictionary to the N L P community and to train dictionaries for specific corpora. 1 I N T R O D U C T I O N This paper describes an implemented program that takes an untagged text corpus and generates a partial list of verbs occurring in it and the sub- categorization frames (SFs) in which they occur. So far, it detects the five SFs shown in Table 1. SF Good Example Bad Example Description direct object direct object & clause direct object & infinitive clause infinitive greet them tell him he's a fool want him to attend know I'll att e n d hope to att end *arrive them *hope him he's a fool *hope him to attend *want I'll attend *greet to attend Table 1: The five subcategorization frames (SFs) detected so far The SF acquisition program has been tested on a corpus of 2.6 million words of the Wall Street Journal (kindly provided by the Penn Tree Bank project). On this corpus, it makes 5101 observa- tions about 2258 orthographically distinct verbs. False positive rates vary from one to three percent of observations, depending on the SF. 1.1 W H Y I T M A T T E R S Accurate parsing requires knowing the sub- categorization frames of verbs, as shown by (1). (1 ) a. I expected [nv the man who smoked NP] to eat ice-cream h. I doubted [NP the man who liked to eat ice-cream NP] Current high-coverage parsers tend to use either custom, hand-generated lists of subcategorization frames (e.g., Hindle, 1983), or published, hand- generated lists like the Ozford Advanced Learner's Dictionary of Contemporary English, Hornby and Covey (1973) (e.g., DeMarcken, 1990). In either case, such lists are expensive to build and to main- tain in the face of evolving usage. In addition, they tend not to include rare usages or specialized vocabularies like financial or military jargon. Fur- ther, they are often incomplete in arbitr ary ways. For example, Webster's Ninth New Collegiate Dic- tionary lists the sense of strike meaning 'go occur to", as in "it struck him t h a t . . . ", b u t it does not list tha t same sense of hit. (My program discov- ered both.) 1.2 W H Y I T ' S H A R D The initial priorities in this research were: . Generality (e.g., minimal assumptions about the text) . Accuracy in identifying SF occurrences • Simplicity of design and speed Efficient use of the available text was not a high priority, since it was felt that plenty of text was available even for an inefficient learner, assuming sufficient speed to make use of it. These priorities 20 9had a substantial influence on the approach taken. Th e y are evaluated in retrospect in Section 4. The first step in finding a subcategorization frame is finding a verb. Because of widespread and productive noun / ve r b ambiguity, dictionaries are not much use - - they do not reliably exclude the possibility oflexical ambiguity. Even if they did, a pr ogram t hat could only learn SFs for unambigu- ous verbs would be of limited value. Statistical disambiguators make dictionaries more useful, but they have a fairly high error rate, and degrade in the presence of many unfamiliar words. Further, it is often difficult to understand where the error is coming from or how to correct it. So finding verbs poses a serious challenge for the design of an accu- rate, general-purpose algorithm for detecting SFs. In fact, finding main verbs is more difficult than it might seem. One problem is distinguishing participles from adjectives and nouns, as shown below. (2) a. John has [~p rented furniture] (comp.: John has often rented apart- ments) b. John was smashed (drunk) last night (comp.: John was kissed last night) c. John' s favorite activity is watching T V (comp.: John's favorite child is watching TV) In each case the main verb is have or be in a con- text where most parsers (and statistical disam- biguators) would mistake it for an auxiliary and mistake the following word for a participial main verb. A second challenge t o accuracy is determin- ing which verb to associate a given complement with. Paradoxically, example (1) shows that in general it isn't possible t o do this without already knowing the SF. One obvious strategy would be to wait for sentences where there is only one can- didate verb; unfortunately, it is very difficult to know for certain how m a n y verbs occur in a sen- tence. Finding some o f the verbs in a text reliably is hard enough; finding all of the m reliably is well beyond the scope o f this work. Finally, any system applied to real input, no mat ter how carefully designed, will occasionally make errors in finding the verb and determining its subcategorizatiou frame. The more times a given verb appears in the corpus, the more likely it is tha t one of those occurrences will cause an erroneous judgmen t . For that reason any learn- ing system th a t gets only positive examples and makes a p erman e n t judgmen t on a single example will always degrade as the numb e r of occurrences increases. In fact, making a judgment based on any fixed

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

MIT 6 863J - AUTOMATIC ACQUISITION

Sign up for free to view:

Please select your school