MIT 6 863J - A New Statistical Parser Based on Bigram Lexical Dependencies - D2138972

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> A New Statistical Parser Based on Bigram Lexical Dependencies

DOC PREVIEW

MIT 6 863J - A New Statistical Parser Based on Bigram Lexical Dependencies

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A N e w Statistical Parser Based on Bi g ram Lexical D e p e n d e n c i e s M i c h a e l J o h n C o l l i n s * D e p t . of C o m p u t e r a n d I n f o r m a t i o n S c i e n c e U n i v e r s i t y of P e n n s y l v a n i a P h i l a d e l p h i a , PA , 19104, U . S . A . mcollins@gradient, cis. upenn, edu Abstract This paper describes a new statistical parser which is based on probabilities of dependencies between head-words in the parse tree. Sta nd ard bigram probability es- timation techniques are extended to calcu- late probabilities of dependencies between pairs of words. Tests using Wall Street Journal data show that the meth od per- forms at least as well as SP A T TER (Mager- ma n 95; Jelinek et al. 94), which has the best published results for a statistical parser on this task. The simplicity of the approach means the model trains on 40,000 sentences in under 15 minutes. With a be am search strategy parsing speed can be improved to over 200 sentences a minute with negligible loss in accuracy. 1 I n t r o d u c t i o n Lexical information has been shown to be crucial for many parsing decisions, such as prepositional-phrase attachm ent (for example (Hindle and Rooth 93)). However, early approaches to probabilistic parsing (Pereira and Schabes 92; Magerman and Marcus 91; Briscoe and Carroll 93) conditioned probabilities on non-terminal labels and part of speech tags alone. The S P A T T ER parser (Magerman 95; 3elinek et ah 94) does use lexical information, and recovers labeled constituents in Wall Street Journal text with above 84% accuracy - as far as we know the best published results on this task. This paper describes a new parser which is much simpler than SP A TTER , yet performs at least as well when trained and tested on the same Wall Street Journal data. Th e m eth od uses lexical informa- tion directly by modeling head-modifier 1 relations between pairs of words. In this way it is similar to *This research was supported by ARPA Grant N6600194-C6043. 1By 'modifier' we mean the linguistic notion of either an argument or adjunct. Link grammars (Lafferty et al. 92), and dependency grammars in general. 2 T h e S t a t i s t i c a l M o d e l Th e aim of a parser is to take a tagged sentence as input (for example Figure l(a)) and produce a phrase-structure tree as o utpu t (Figure l(b)). A statistical approach to this problem consists of two components. First, the statistical model assigns a probability to every candidate parse tree for a sen- tence. Formally, given a sentence S and a tree T, the model estimates the conditional probability P(T[S ) . The most likely parse under the model is then: Tb~,, -- a r g maxT P ( T I S ) (1) Second, the parser is a met hod for finding Tbest. This section describes the statistical model, while section 3 describes the parser. Th e key to the statistical model is th at any tree such as Figure l(b) can be represented as a set of b a s e N P s 2 and a set of d e p e n d e n c i e s as in Fig- ure l(c). We call the set of baseNPs B, and the set of dependencies D; Figure l(d) shows B and D for this example. For the purposes of our model, T = (B, D), and: P ( T I S ) = P ( B , D ] S ) = P (B [ S ) x P ( D ] S , B ) (2) S is the sentence with words tagged for part of speech. T h a t is, S = < (wl, t l ) , (w2,t2) . . . ( w ~ ,t ,) >. For POS tagging we use a maximum-entropy tag- ger described in (Ratnaparkhi 96). The tagger per- forms at around 97% accuracy on Wall Street Jour- nal Text, and is trained on the first 40,000 sentences of the Penn Treebank (Marcus et al. 93). Given S and B, the r e d u c e d s e n t e n c e :~ is de- fined as the subsequence of S which is formed by removing punctuation and reducing all baseNPs to their head-word alone. ~A baseNP or 'minimal' NP is a non-recursive NP, i.e. none of its child constituents are NPs. The term was first used in (l:tamshaw and Marcus 95). 184(a) J o h n / NN P S m ith/N N P, t h e / D T president/NN of/IN IB M/NNP , a nnounced /VB D his/PR, P$ res- ignation/NN ye sterday/N N . (b) S NP J ~ NP NP NP PP A A IN NP NNP NNP DT NN I a I I I I ] NNP I John Smith the president of IBM VP VBD NP NP PRP$ NN NN I I I announced his resignation yesterday (c) [John NP S VP VBD Sm i th] [the pre s i d e n t ] of [IBM] announced [his VP NP vp NP I I re si gnati on ] [y e s t e r day ] (d) B={ [John Smith], [the president], [IBM], [his resignation], [yesterday] } NP S VP NP NP NP NPNPPP INPPNP VBD vP NP D=[ Smith announced, Smith president, president of, of IBM, announced resignation V B D V P N P announced yesterday } Figure 1: An overview of the representation used by the model. (a) The tagged sentence; (b) A candidate parse-tree (the correct one); (c) A dependency representation of (b). Square brackets enclose baseNPs (heads of baseNPs are marked in bold). Arrows show modifier --* head dependencies. Section 2.1 describes how arrows are labeled with non-terminal triples from the parse-tree. Non-head words within baseNPs are excluded from the dependency structure; ( d) B, the set of baseNPs, and D, the set of dependencies, are extracted from (c). Thus the reduced sentence is an array of word/ tag pairs, S = < (t~l,tl),(@2,f2)...(@r~,f,~)>, where m _~ n. For example for Figure l(a) E x a m p l e 1 S = < (Smith, g gP ), (president, NN), (of, IN), (IBM, NNP), (announced, VBD), (resignation, N N), (yesterday, N g) > Sections 2.1 to 2.4 describe the dependency model. Section 2.5 then describes the baseNP model, which uses bigram tagging techniques similar to (Ramshaw and Marcus 95; Church 88). 2.1 T h e M a p p i n g f r o m T r e e …

View Full Document