A N e w Statistical Parser Based on Bi g ram Lexical D e p e n d e n c i e s M i c h a e l J o h n C o l l i n s * D e p t . of C o m p u t e r a n d I n f o r m a t i o n S c i e n c e U n i v e r s i t y of P e n n s y l v a n i a P h i l a d e l p h i a , PA , 19104, U . S . A . mcollins@gradient, cis. upenn, edu Abstract This paper describes a new statistical parser which is based on probabilities of dependencies between head-words in the parse tree. Sta nd ard bigram probability es- timation techniques are extended to calcu- late probabilities of dependencies between pairs of words. Tests using Wall Street Journal data show that the meth od per- forms at least as well as SP A T TER (Mager- ma n 95; Jelinek et al. 94), which has the best published results for a statistical parser on this task. The simplicity of the approach means the model trains on 40,000 sentences in under 15 minutes. With a be am search strategy parsing speed can be improved to over 200 sentences a minute with negligible loss in accuracy. 1 I n t r o d u c t i o n Lexical information has been shown to be crucial for many parsing decisions, such as prepositional-phrase attachm ent (for example (Hindle and Rooth 93)). However, early approaches to probabilistic parsing (Pereira and Schabes 92; Magerman and Marcus 91; Briscoe and Carroll 93) conditioned probabilities on non-terminal labels and part of speech tags alone. The S P A T T ER parser (Magerman 95; 3elinek et ah 94) does use lexical information, and recovers labeled constituents in Wall Street Journal text with above 84% accuracy - as far as we know the best published results on this task. This paper describes a new parser which is much simpler than SP A TTER , yet performs at least as well when trained and tested on the same Wall Street Journal data. Th e m eth od uses lexical informa- tion directly by modeling head-modifier 1 relations between pairs of words. In this way it is similar to *This research was supported by ARPA Grant N6600194-C6043. 1By 'modifier' we mean the linguistic notion of either an argument or adjunct. Link grammars (Lafferty et al. 92), and dependency grammars in general. 2 T h e S t a t i s t i c a l M o d e l Th e aim of a parser is to take a tagged sentence as input (for example Figure l(a)) and produce a phrase-structure tree as o utpu t (Figure l(b)). A statistical approach to this problem consists of two components. First, the statistical model assigns a probability to every candidate parse tree for a sen- tence. Formally, given a sentence S and a tree T, the model estimates the conditional probability P(T[S ) . The most likely parse under the model is then: Tb~,, -- a r g maxT P ( T I S ) (1) Second, the parser is a met hod for finding Tbest. This section describes the statistical model, while section 3 describes the parser. Th e key to the statistical model is th at any tree such as Figure l(b) can be represented as a set of b a s e N P s 2 and a set of d e p e n d e n c i e s as in Fig- ure l(c). We call the set of baseNPs B, and the set of dependencies D; Figure l(d) shows B and D for this example. For the purposes of our model, T = (B, D), and: P ( T I S ) = P ( B , D ] S ) = P (B [ S ) x P ( D ] S , B ) (2) S is the sentence with words tagged for part of speech. T h a t is, S = < (wl, t l ) , (w2,t2) . . . ( w ~ ,t ,) >. For POS tagging we use a maximum-entropy tag- ger described in (Ratnaparkhi 96). The tagger per- forms at around 97% accuracy on Wall Street Jour- nal Text, and is trained on the first 40,000 sentences of the Penn Treebank (Marcus et al. 93). Given S and B, the r e d u c e d s e n t e n c e :~ is de- fined as the subsequence of S which is formed by removing punctuation and reducing all baseNPs to their head-word alone. ~A baseNP or 'minimal' NP is a non-recursive NP, i.e. none of its child constituents are NPs. The term was first used in (l:tamshaw and Marcus 95). 184(a) J o h n / NN P S m ith/N N P, t h e / D T president/NN of/IN IB M/NNP , a nnounced /VB D his/PR, P$ res- ignation/NN ye sterday/N N . (b) S NP J ~ NP NP NP PP A A IN NP NNP NNP DT NN I a I I I I ] NNP I John Smith the president of IBM VP VBD NP NP PRP$ NN NN I I I announced his resignation yesterday (c) [John NP S VP VBD Sm i th] [the pre s i d e n t ] of [IBM] announced [his VP NP vp NP I I re si gnati on ] [y e s t e r day ] (d) B={ [John Smith], [the president], [IBM], [his resignation], [yesterday] } NP S VP NP NP NP NPNPPP INPPNP VBD vP NP D=[ Smith announced, Smith president, president of, of IBM, announced resignation V B D V P N P announced yesterday } Figure 1: An overview of the representation used by the model. (a) The tagged sentence; (b) A candidate parse-tree (the correct one); (c) A dependency representation of (b). Square brackets enclose baseNPs (heads of baseNPs are marked in bold). Arrows show modifier --* head dependencies. Section 2.1 describes how arrows are labeled with non-terminal triples from the parse-tree. Non-head words within baseNPs are excluded from the dependency structure; ( d) B, the set of baseNPs, and D, the set of dependencies, are extracted from (c). Thus the reduced sentence is an array of word/ tag pairs, S = < (t~l,tl),(@2,f2)...(@r~,f,~)>, where m _~ n. For example for Figure l(a) E x a m p l e 1 S = < (Smith, g gP ), (president, NN), (of, IN), (IBM, NNP), (announced, VBD), (resignation, N N), (yesterday, N g) > Sections 2.1 to 2.4 describe the dependency model. Section 2.5 then describes the baseNP model, which uses bigram tagging techniques similar to (Ramshaw and Marcus 95; Church 88). 2.1 T h e M a p p i n g f r o m T r e e …
View Full Document