Automatic Category Label Coarsening for Syntax-Based Machine Translation


Unformatted text preview:

Proceedings of SSST-5, Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 98–106, ACL HLT 2011, Portland, Oregon, USA, June 2011. c©2011 Association for Computational Linguistics Automatic Category Label Coarsening for Syntax-Based Machine Translation Greg Hanneman and Alon Lavie Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA {ghannema, alavie} Abstract We consider SCFG-basedMT systems that get syntactic category labels from parsing both the source and target sides of parallel train- ing data. The resulting joint nonterminals of- ten lead to needlessly large label sets that are not optimized for an MT scenario. This pa- per presents a method of iteratively coarsening a label set for a particular language pair and training corpus. We apply this label collaps- ing on Chinese–English and French–English grammars, obtaining test-set improvements of up to 2.8 BLEU, 5.2 TER, and 0.9 METEOR on Chinese–English translation. An analysis of label collapsing’s effect on the grammar and the decoding process is also given. 1 Introduction A common modeling choice among syntax-based statistical machine translation systems is the use of synchronous context-free grammar (SCFG), where a source-language string and a target-language string are produced simultaneously by applying a series of re-write rules. Given a parallel corpus that has been statistically word-aligned and annotated with con- stituency structure on one or both sides, SCFG mod- els for MT can be learned via a variety of methods. Parsing may be applied on the source side (Liu et al., 2006), on the target side (Galley et al., 2004), or on both sides of the parallel corpus (Lavie et al., 2008; Zhechev and Way, 2008). In any of these cases, using the raw label set from source- and/or target-side parsers can be undesir- able. Label sets used in statistical parsers are usu- ally inherited directly from monolingual treebank projects, where the inventory of category labels was designed by independent teams of human linguists. These labels sets are not necessarily ideal for sta- tistical parsing, let alone for bilingual syntax-based translation models. Further, the side(s) on which syntax is represented defines the nonterminal label space used by the resulting SCFG. A pair of aligned adjectives, for example, may be labeled ADJ if only source-side syntax is used, JJ if only target-side syn- tax is used, or ADJ::JJ if syntax from both sides is used in the grammar. Beyond such differences, however, most existing SCFG-based MT systems do not further modify the nonterminal label set in use. Those that do require either specialized de- coders or complicated parameter tuning, or the la- bel set may be unsatisfactory from a computational point of view (Section 2). We believe that representing both source-side and target-side syntax is important. Even assuming two monolingually perfect label sets for the source and target languages, using label information from only one side ignores any meaningful constraints ex- pressed in the labels of the other. On the other hand, using the default node labels from both sides gener- ates a joint nonterminal set of thousands of unique labels, ...

Loading Unlocking...


Join to view Automatic Category Label Coarsening for Syntax-Based Machine Translation and access 3M+ class-specific study document.

We will never post anything without your permission.
Don't have an account?

Sign Up

Join to view Automatic Category Label Coarsening for Syntax-Based Machine Translation and access 3M+ class-specific study document.


By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?