DOC PREVIEW
Automatic Category Label Coarsening for Syntax-Based Machine Translation

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Proceedings of SSST-5, Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 98–106,ACL HLT 2011, Portland, Oregon, USA, June 2011.c2011 Association for Computational LinguisticsAutomatic Category Label Coarsening for Syntax-Based MachineTranslationGreg Hanneman and Alon LavieLanguage Technologies InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 USA{ghannema, alavie}@cs.cmu.eduAbstractWe consider SCFG-based MT systems that getsyntactic category labels from parsing boththe source and target sides of parallel train-ing data. The resulting joint nonterminals of-ten lead to needlessly large label sets that arenot optimized for an MT scenario. This pa-per presents a method of iteratively coarseninga label set for a particular language pair andtraining corpus. We apply this label collaps-ing on Chinese–English and French–Englishgrammars, obtaining test-set improvements ofup to 2.8 BLEU, 5.2 TER, and 0.9 METEORon Chinese–English translation. An analysisof label collapsing’s effect on the grammarand the decoding process is also given.1 IntroductionA common modeling choice among syntax-basedstatistical machine translation systems is the use ofsynchronous context-free grammar (SCFG), where asource-language string and a target-language stringare produced simultaneously by applying a series ofre-write rules. Given a parallel corpus that has beenstatistically word-aligned and annotated with con-stituency structure on one or both sides, SCFG mod-els for MT can be learned via a variety of methods.Parsing may be applied on the source side (Liu et al.,2006), on the target side (Galley et al., 2004), or onboth sides of the parallel corpus (Lavie et al., 2008;Zhechev and Way, 2008).In any of these cases, using the raw label set fromsource- and/or target-side parsers can be undesir-able. Label sets used in statistical parsers are usu-ally inherited directly from monolingual treebankprojects, where the inventory of category labels wasdesigned by independent teams of human linguists.These labels sets are not necessarily ideal for sta-tistical parsing, let alone for bilingual syntax-basedtranslation models. Further, the side(s) on whichsyntax is represented defines the nonterminal labelspace used by the resulting SCFG. A pair of alignedadjectives, for example, may be labeled ADJ if onlysource-side syntax is used, JJ if only target-side syn-tax is used, or ADJ::JJ if syntax from both sidesis used in the grammar. Beyond such differences,however, most existing SCFG-based MT systemsdo not further modify the nonterminal label set inuse. Those that do require either specialized de-coders or complicated parameter tuning, or the la-bel set may be unsatisfactory from a computationalpoint of view (Section 2).We believe that representing both source-side andtarget-side syntax is important. Even assuming twomonolingually perfect label sets for the source andtarget languages, using label information from onlyone side ignores any meaningful constraints ex-pressed in the labels of the other. On the other hand,using the default node labels from both sides gener-ates a joint nonterminal set of thousands of uniquelabels, not all of which may be useful. Our real pref-erence is to use a joint nonterminal set adapted toour particular language pair or translation task.In this paper, we present the first step towardsa tailored label set: collapsing syntactic categoriesto remove the most redundant labels and shrink theoverall source–target nonterminal set.1There are1The complementary operation, splitting existing labels, isbeyond the scope of this paper and is left for future work.98two problems with an overly large label set:First, it encourages labeling ambiguity amongrules, a well-known practical problem in SCFG-based MT. Most simply, the same right-hand sidemay be observed in rule extraction with a variety ofleft-hand-side labels, each leading to a unique rulein the grammar. The grammar may further containmany rules with the same structure and reorderingpattern that differ only with respect to the actual la-bels in use. Together, these properties can cause anSCFG-based MT system to process a large numberof alternative syntactic derivations that use differentrules but produce identical output strings. Limitingthe possible number of variant labelings cuts downon ambiguous derivations.Second, a large label set leads to rule sparsity. Arule whose right-hand side can only apply on a verytightly specified set of labels is unlikely to be es-timated reliably from a parallel corpus or to applyin all needed cases at test time. However, a coarserversion of its application constraints may be morefrequently observed in training data and more likelyto apply on test data.We therefore introduce a method for automati-cally clustering and collapsing category labels, oneither one or both sides of SCFG rules, for any lan-guage pair and choice of statistical parsers (Section3). Turning to alignments between source and tar-get parse nodes as an additional source of informa-tion, we calculate a distance metric between anytwo labels in one language based on the differencein alignment probabilities to labels in the other lan-guage. We then apply a greedy label collapsing al-gorithm that repeatedly merges the two labels withthe closest distance until some stopping criterion isreached. The resulting coarsened labels are used inthe SCFG rules of a syntactic machine translationsystem in place of the original labels.In experiments on Chinese–English translation(Section 4), we find significantly improved perfor-mance of up to 2.8 BLEU points, 5.2 TER points,and 0.9 METEOR points by applying varying de-grees of label collapsing to a baseline syntax-basedMT system (Section 5). In our analysis of the results(Section 6), we find that the largest immediate effectof coarsening the label set is to reduce the number offully abstract hierarchical SCFG rules present in thegrammar. These rules’ increased permissiveness, inturn, directs the decoder’s search into a largely dis-joint realm from the search space explored by thebaseline system. A full summary and ideas for fu-ture work are given in Section 7.2 Related WorkOne example of modifying the SCFG nonterminalset is seen in the Syntax-Augmented MT (SAMT)system of Zollmann and Venugopal (2006). InSAMT rule extraction, rules whose left-hand sidescorrespond exactly to a target-side parse node t re-tain that label in the grammar. Additional nontermi-nal


Automatic Category Label Coarsening for Syntax-Based Machine Translation

Download Automatic Category Label Coarsening for Syntax-Based Machine Translation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Automatic Category Label Coarsening for Syntax-Based Machine Translation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Automatic Category Label Coarsening for Syntax-Based Machine Translation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?