MIT 6 863J - Three Generative, Lexicalised Models for Statistical Parsing - D2750174

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 863J> Three Generative, Lexicalised Models for Statistical Parsing

DOC PREVIEW

MIT 6 863J - Three Generative, Lexicalised Models for Statistical Parsing

School name Massachusetts Institute of Technology

Course 6 863j- Natural Language and the Computer Representation of Knowledge

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Three Generative, Lexicalised Models for Statistical Parsing Michael Collins* Dept. of Computer and Information Science University of Pennsylvania Philadelphia, PA, 19104, U.S.A. mcollins~gradient, cis. upenn, edu Abstract In this paper we first propose a new sta- tistical parsing model, which is a genera- tive model of lexicalised context-free gram- mar. We then extend the model to in- clude a probabilistic treatment of both sub- categorisation and wh-movement. Results on Wall Street Journal text show that the parser performs at 88.1/87.5% constituent precision/recall, an average improvement of 2.3% over (Collins 96). 1 Introduction Generative models of syntax have been central in linguistics since they were introduced in (Chom- sky 57). Each sentence-tree pair (S,T) in a lan- guage has an associated top-down derivation con- sisting of a sequence of rule applications of a gram- mar. These models can be extended to be statisti- cal by defining probability distributions at points of non-determinism in the derivations, thereby assign- ing a probability 7)(S, T) to each (S, T) pair. Proba- bilistic context free grammar (Booth and Thompson 73) was an early example of a statistical grammar. A PCFG can be lexicalised by associating a head- word with each non-terminal in a parse tree; thus far, (Magerman 95; Jelinek et al. 94) and (Collins 96), which both make heavy use of lexical informa- tion, have reported the best statistical parsing per- formance on Wall Street Journal text. Neither of these models is generative, instead they both esti- mate 7)(T] S) directly. This paper proposes three new parsing models. Model 1 is essentially a generative version of the model described in (Collins 96). In Model 2, we extend the parser to make the complement/adjunct distinction by adding probabilities over subcategori- sation frames for head-words. In Model 3 we give a probabilistic treatment of wh-movement, which This research was supported by ARPA Grant N6600194-C6043. is derived from the analysis given in Generalized Phrase Structure Grammar (Gazdar et al. 95). The work makes two advances over previous models: First, Model 1 performs significantly better than (Collins 96), and Models 2 and 3 give further im- provements -- our final results are 88.1/87.5% con- stituent precision/recall, an average improvement of 2.3% over (Collins 96). Second, the parsers in (Collins 96) and (Magerman 95; Jelinek et al. 94) produce trees without information about wh- movement or subcategorisation. Most NLP applica- tions will need this information to extract predicate- argument structure from parse trees. In the remainder of this paper we describe the 3 models in section 2, discuss practical issues in sec- tion 3, give results in section 4, and give conclusions in section 5. 2 The Three Parsing Models 2.1 Model 1 In general, a statistical parsing model defines the conditional probability, 7)(T] S), for each candidate parse tree T for a sentence S. The parser itself is an algorithm which searches for the tree, Tb~st, that maximises 7~(T I S). A generative model uses the observation that maximising 7V(T, S) is equivalent to maximising 7~(T ] S): 1 Tbe,t = argm~xT~(TlS) = argmTax ?~(T,S) ~(s) = arg m~x 7~(T, S) (1) 7~(T, S) is then estimated by attaching probabilities to a top-down derivation of the tree. In a PCFG, for a tree derived by n applications of context-free re-write rules LHSi ~ RHSi, 1 < i < n, 7~(T,S) = H 7)(RHSi I LHSi) (2) i=l..n The re-write rules are either internal to the tree, where LHS is a non-terminal and RHS is a string 7~(T,S) 17~(S) is constant, hence maximising ~ is equiv- alent to maximising "P(T, S). 16TOP i S(bought) NP(w~ought ) t VB/~Np m JJ NN NNP I I I ooks) Last week Marks I 1 bought NNP f Brooks TOP -> S(bought) S(bought) -> NP(week) NP(week) -> JJ(Last) NP (Marks) -> NNP (Marks) VP (bought) -> VB (bought) NP (Brooks) -> NNP (Brooks) NP(Marks) VP(bought) NN(week) NP(Brooks) Figure 1: A lexicalised parse tree, and a list of the rules it contains. For brevity we omit the POS tag associated with each word. of one or more non-terminals; or lexical, where LHS is a part of speech tag and RHS is a word. A PCFG can be lexicalised 2 by associating a word w and a part-of-speech (POS) tag t with each non- terminal X in the tree. Thus we write a non- terminal as X(x), where x = (w,t), and X is a constituent label. Each rule now has the form3: P(h) -> Ln(In)...ni(ll)H(h)Rl(rl)...Rm(rm) (3) H is the head-child of the phrase, which inherits the head-word h from its parent P. L1...L~ and R1...Rm are left and right modifiers of H. Either n or m may be zero, and n = m = 0 for unary rules. Figure 1 shows a tree which will be used as an example throughout this paper. The addition of lexical heads leads to an enormous number of potential rules, making direct estimation of ?)(RHS { LHS) infeasible because of sparse data problems. We decompose the generation of the RHS of a rule such as (3), given the LHS, into three steps -- first generating the head, then making the inde- pendence assumptions that the left and right mod- ifiers are generated by separate 0th-order markov processes 4: 1. Generate the head constituent label of the phrase, with probability 7)H(H I P, h). 2. Generate modifiers to the right of the head with probability 1-Ii=1..m+1 ~n(Ri(ri) { P, h, H). R,~+l(r,~+l) is defined as STOP -- the STOP symbol is added to the vocabulary of non- terminals, and the model stops generating right modifiers when it is generated. 2We find lexical heads in Penn treebank data using rules which are similar to those used by (Magerman 95; Jelinek et al. 94). SWith the exception of the top rule in the tree, which has the form TOP --+ H(h). 4An exception is the first rule in the tree, T0P -+ H (h), which has probability Prop (H, hlTOP ) 3. Generate modifiers to the left of the head with probability rL=l..n+ l ?) L ( L~( li ) l P, h, H), where Ln+l (ln+l) = STOP. For example, the probability of the rule S(bought) -> NP(week)

View Full Document