Better Informed Training of Latent Syntactic Features

Home> Academic Documents> Better Informed Training of Latent Syntactic Features

DOC PREVIEW

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 317–326,Sydney, July 2006.c2006 Association for Computational LinguisticsBetter Informed Training of Latent Syntactic FeaturesMarkus Dreyer and Jason EisnerDepartment of Computer Science / Center for Language and Speech ProcessingJohns Hopkins University3400 North Charles Street, Baltimore, MD 21218 USA{markus,jason}@clsp.jhu.eduAbstractWe study unsupervised methods for learn-ing refinements of the nonterminals ina treebank. Following Matsuzaki et al.(2005) and Prescher (2005), we may forexample split NP without supervision intoNP[0] and NP[1], which behave differently.We first propose to learn a PCFG that addssuch features to nonterminals in such away that they respect patterns of linguis-tic feature passing: each node’s nontermi-nal features are either identical to, or inde-pendent of, those of its parent. This lin-guistic constraint reduces runtime and thenumber of parameters to be learned. How-ever, it did not yield improvements whentraining on the Penn Treebank. An orthog-onal strategy was more successful: to im-prove the performance of the EM learnerby treebank preprocessing and by anneal-ing methods that split nonterminals selec-tively. Using these methods, we can main-tain high parsing accuracy while dramati-cally reducing the model size.1 IntroductionTreebanks never contain enough information; thusPCFGs estimated straightforwardly from the PennTreebank (Bies et al., 1995) work only moderatelywell (Charniak, 1996). To address this problem,researchers have used heuristics to add more infor-mation. Eisner (1996), Charniak (1997), Collins(1997), and many subsequent researchers1anno-tated every node with lexical features passed upfrom its “head child,” in order to more precisely re-flect the node’s “inside” contents. Charniak (1997)and Johnson (1998) annotated each node with itsparent and grandparent nonterminals, to more pre-cisely reflect its “outside” context. Collins (1996)split the sentence label S into two versions, repre-senting sentences with and without subjects. He1Not to mention earlier non-PCFG lexicalized statisticalparsers, notably Magerman (1995) for the Penn Treebank.also modified the treebank to contain different la-bels for standard and for base noun phrases. Kleinand Manning (2003) identified nonterminals thatcould valuably be split into fine-grained ones us-ing hand-written linguistic rules. Their unlexical-ized parser combined several such heuristics withrule markovization and reached a performancesimilar to early lexicalized parsers.In all these cases, choosing which nonterminalsto split, and how, was a matter of art. Ideallysuch splits would be learned automatically fromthe given treebank itself. This would be less costlyand more portable to treebanks for new domainsand languages. One might also hope that the auto-matically learned splits would be more effective.Matsuzaki et al. (2005) introduced a model forsuch learning: PCFG-LA.2They used EM to in-duce fine-grained versions of a given treebank’snonterminals and rules. We present models thatsimilarly learn to propagate fine-grained featuresthrough the tree, but only in certain linguisticallymotivated ways. Our models therefore allocatea supply of free parameters differently, allow-ing more fine-grained nonterminals but less fine-grained control over the probabilities of rewritingthem. We also present simple methods for decid-ing selectively (during training) which nontermi-nals to split and how.Section 2 describes previous work in findinghidden information in treebanks. Section 3 de-scribes automatically induced feature grammars.We start by describing the PCFG-LA model, thenintroduce new models that use specific agreementpatterns to propagate features through the tree.Section 4 describes annealing-like procedures fortraining latent-annotation models. Section 5 de-scribes the motivation and results of our experi-ments. We finish by discussing future work andconclusions in sections 6–7.2Probabilistic context-free grammar with latent annota-tions.317Citation Observed data Hidden dataCollins (1997) Treebank tree with head child an-notated on each nonterminalNo hidden data. Degenerate EMcase.Lari and Young (1990) Words Parse treePereira and Schabes (1992) Words and partial brackets Parse treeKlein and Manning (2001) Part-of-speech tags Parse treeChiang and Bikel (2002) Treebank tree Head child on each nonterminalMatsuzaki et al. (2005) Treebank tree Integer feature on each nontermi-nalINHERIT model (this paper) Treebank tree and head childheuristicsInteger feature on each nontermi-nalTable 1: Observed and hidden data in PCFG grammar learning.2 Partially supervised EM learningThe parameters of a PCFG can be learned withor without supervision. In the supervised case,the complete tree is observed, and the rewrite ruleprobabilities can be estimated directly from theobserved rule counts. In the unsupervised case,only the words are observed, and the learningmethod must induce the whole structure abovethem. (See Table 1.)In the partially supervised case we will con-sider, some part of the tree is observed, andthe remaining information has to be induced.Pereira and Schabes (1992) estimate PCFG pa-rameters from partially bracketed sentences, usingthe inside-outside algorithm to induce the miss-ing brackets and the missing node labels. Someauthors define a complete tree as one that speci-fies not only a label but also a “head child” foreach node. Chiang and Bikel (2002) induces themissing head-child information; Prescher (2005)induces both the head-child information and thelatent annotations we will now discuss.3 Feature Grammars3.1 The PCFG-LA ModelStaying in the partially supervised paradigm, thePCFG-LA model described in Matsuzaki et al.(2005) observe whole treebank trees, but learnan “annotation” on each nonterminal token—anunspecified and uninterpreted integer that distin-guishes otherwise identical nonterminals. Just asCollins manually split the S nonterminal label intoS and SG for sentences with and without subjects,Matsuzaki et al. (2005) split S into S[1], S[2], ...,S[L] where L is a predefined number—but they doit automatically and systematically, and not onlyfor S but for every nonterminal. Their partiallysupervised learning procedure observes trees thatare fully bracketed and fully labeled, except forthe integer subscript used to annotate each node.After


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Please select your school