Three Generative, Lexicalised Models for Statistical Parsing Michael Collins* Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA, 19104, U.S.A. mcollins~gradient, cis. upenn, edu Abstract In this paper we first propose a new sta- tistical parsing model, which is a genera- tive model of lexicalised context-free gram- mar. We then extend the model to in- clude a probabilistic treatment of both sub- categorisation and wh-movement. Results on Wall Street Journal text show that the parser performs at 88.1/87.5% constituent precision/recall, an average improvement of 2.3% over (Collins 96). 1 Introduction Generative models of syntax have been central in linguistics since they were introduced in (Chom- sky 57). Each sentence-tree pair (S,T) in a lan- guage has an associated top-down derivation con- sisting of a sequence of rule applications of a gram- mar. These models can be extended to be statisti- cal by defining probability distributions at points of non-determinism in the derivations, thereby assign- ing a probability 7)(S, T) to each (S, T) pair. Proba- bilistic context free grammar (Booth and Thompson 73) was an early example of a statistical grammar. A PCFG can be lexicalised by associating a head- word with each non-terminal in a parse tree; thus far, (Magerman 95; Jelinek et al. 94) and (Collins 96), which both make heavy use of lexical informa- tion, have reported the best statistical parsing per- formance on Wall Street Journal text. Neither of these models is generative, instead they both esti- mate 7)(T] S) directly. This paper proposes three new parsing models. Model 1 is essentially a generative version of the model described in (Collins 96). In Model 2, we extend the parser to make the complement/adjunct distinction by adding probabilities over subcategori- sation frames for head-words. In Model 3 we give a probabilistic treatment of wh-movement, which This research was supported by ARPA Grant N6600194-C6043. is derived from the analysis given in Generalized Phrase Structure Grammar (Gazdar et al. 95). The work makes two advances over previous models: First, Model 1 performs significantly better than (Collins 96), and Models 2 and 3 give further im- provements -- our final results are 88.1/87.5% con- stituent precision/recall, an average improvement of 2.3% over (Collins 96). Second, the parsers in (Collins 96) and (Magerman 95; Jelinek et al. 94) produce trees without information about wh- movement or subcategorisation. Most NLP applica- tions will need this information to extract predicate- argument structure from parse trees. In the remainder of this paper we describe the 3 models in section 2, discuss practical issues in sec- tion 3, give results in section 4, and give conclusions in section 5. 2 The Three Parsing Models 2.1 Model 1 In general, a statistical parsing model defines the conditional probability, 7)(T] S), for each candidate parse tree T for a sentence S. The parser itself is an algorithm which searches for the tree, Tb~st, that maximises 7~(T I S). A generative model uses the observation that maximising 7V(T, S) is equivalent to maximising 7~(T ] S): 1 Tbe,t = argm~xT~(TlS) = argmTax ?~(T,S) ~(s) = arg m~x 7~(T, S) (1) 7~(T, S) is then estimated by attaching probabilities to a top-down derivation of the tree. In a PCFG, for a tree derived by n applications of context-free re-write rules LHSi ~ RHSi, 1 < i < n, 7~(T,S) = H 7)(RHSi I LHSi) (2) i=l..n The re-write rules are either internal to the tree, where LHS is a non-terminal and RHS is a string 7~(T,S) 17~(S) is constant, hence maximising ~ is equiv- alent to maximising "P(T, S). 16TOP i S(bought) NP(w~ought ) t VB/~Np m JJ NN NNP I I I ooks) Last week Marks I 1 bought NNP f Brooks TOP -> S(bought) S(bought) -> NP(week) NP(week) -> JJ(Last) NP (Marks) -> NNP (Marks) VP (bought) -> VB (bought) NP (Brooks) -> NNP (Brooks) NP(Marks) VP(bought) NN(week) NP(Brooks) Figure 1: A lexicalised parse tree, and a list of the rules it contains. For brevity we omit the POS tag associated with each word. of one or more non-terminals; or lexical, where LHS is a part of speech tag and RHS is a word. A PCFG can be lexicalised 2 by associating a word w and a part-of-speech (POS) tag t with each non- terminal X in the tree. Thus we write a non- terminal as X(x), where x = (w,t), and X is a constituent label. Each rule now has the form3: P(h) -> Ln(In)...ni(ll)H(h)Rl(rl)...Rm(rm) (3) H is the head-child of the phrase, which inherits the head-word h from its parent P. L1...L~ and R1...Rm are left and right modifiers of H. Either n or m may be zero, and n = m = 0 for unary rules. Figure 1 shows a tree which will be used as an example throughout this paper. The addition of lexical heads leads to an enormous number of potential rules, making direct estimation of ?)(RHS { LHS) infeasible because of sparse data problems. We decompose the generation of the RHS of a rule such as (3), given the LHS, into three steps -- first generating the head, then making the inde- pendence assumptions that the left and right mod- ifiers are generated by separate 0th-order markov processes 4: 1. Generate the head constituent label of the phrase, with probability 7)H(H I P, h). 2. Generate modifiers to the right of the head with probability 1-Ii=1..m+1 ~n(Ri(ri) { P, h, H). R,~+l(r,~+l) is defined as STOP -- the STOP symbol is added to the vocabulary of non- terminals, and the model stops generating right modifiers when it is generated. 2We find lexical heads in Penn treebank data using rules which are similar to those used by (Magerman 95; Jelinek et al. 94). SWith the exception of the top rule in the tree, which has the form TOP --+ H(h). 4An exception is the first rule in the tree, T0P -+ H (h), which has probability Prop (H, hlTOP ) 3. Generate modifiers to the left of the head with probability rL=l..n+ l ?) L ( L~( li ) l P, h, H), where Ln+l (ln+l) = STOP. For example, the probability of the rule S(bought) -> NP(week)
View Full Document