Stanford CS 224 - Case Prediction for Morphologically Poor Languages - D2017603

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Case Prediction for Morphologically Poor Languages

DOC PREVIEW

Stanford CS 224 - Case Prediction for Morphologically Poor Languages

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1. Introduction1.1. Motivation and related work 2. Methodology2.1. Pipeline overview 2.2. Data 2.3. Tools 2.4. Implementation details 2.5. Data after training3. MEMM Classifier3.1. Theory and implementation 3.2. Baseline performance4. Features4.1. Orthographic features4.2. Context features4.3. POS tags 4.4. POS tags + Context5. Results5.1. Precision vs Recall 5.2. Evaluation soundness 5.3. Limitations 6. Conclusions and future work7. ReferencesCase Prediction for Morphologically Poor LanguagesAnton [email protected] the context of machine translation, languages can take on implicit morphology. For instance, nouns in a morphologically poor language can be assigned implicit cases when being translated into a morphologically rich language. We investigated a technique of predicting cases for English nouns, in the context of English-Czech translation. We used a parser trained on Czech treebank to annotate the target side of the word-aligned parallel corpus. The cases were then projected from Czech into English via the alignments. Finally, a maximum entropy classifier was trained to predict cases of English nouns in unseen sentences. We evaluated our technique on a held-out set of word-aligned data, achieving 62.5 precision, and 20.9 recall (F-score of 31.4). Extensive error analysis showed that contextual and POS features acted as most useful for the classification task. Overall, we showed that one can produce a generalized case annotator for a morphologically poor language, given a morphologically rich one. The achieved precision suggests this technique might be useful for the task of machine translation.1. IntroductionStatistical machine translation relies on large amounts of training data in order to produce adequate translation quality. Unfortunately, for morphologically rich target languages, data sparsity can often cause poor agreement within sentences. Namely, the same word on the source side can be translated as several differently inflected words on the target side. It is, therefore, desirable to disambiguate the morphological information of the source, prior to translation.In this paper, we attempt to enrich the morphologically poor source text by annotating the nouns with implicit cases, given a morphologically rich target language. Throughout the paper, we will be working with an English-Czech system.1.1. Motivation and related work To get a better idea of the impact of errors caused by poor morphological choices, consider the error classification for an English-Greek system of A. Eleftherios and P. Koehn (2008), shown in Figure 1.Figure 1. Classiﬁcation of errors on an English-Greek system as suggested by Vilar et al. (2006) Note that over 40% of errors are caused by incorrect word form. We chose to focus on a particular subset of this problem – noun cases. Although this is an English-Greek system, it is very likely that these problems persist with other morphologically rich languages as well.Our task is to predict cases of nouns in unseen English sentences. Although a lot of work has been done on source annotation, most of the efforts, such as Avramidis and Koehn (2008) and Krujiffova et al. (2006), have been focused around rule-based or manual methods. Although such methods achieve good accuracy and high annotation coverage, they require substantial knowledge of the target language, and therefore do not scale well with the increasing number of languages demanded from statistical machine translation. Kujiffova et al. (2006) substitutes projection for manual rule generation, but still uses manually aligned data, which is not immediately scalable.Our goal is to produce a method of annotating English nouns with cases, given Czech as a target language. We would like this method to be target-language-agnostic, to ensure scalability, and involve no manual or rule-based annotations. Such an annotator can then be useful for the task of machine translation via the factored model, described in Avramidis and Koehn (2008), and Hoang and Kohen (2007).2. MethodologyIn this section we review the end-to-end pipeline of training a noun case annotator for English sentences. We also discuss data and tools involved in building this pipeline.2.1. Pipeline overview Since English does not have noun cases, we must project this information from Czech. Thus, we begin with a large set of English-Czech parallel data. First, we perform word alignment on the sentence level. Then, we annotate Czech side with morphological information, by employing a parser trained on treebank data. Finally, we follow the alignment links from Czech to English, projecting the cases from Czech nouns into English nouns. In order to ensure that projection only happens when nouns align to nouns, we also annotate the English side with a POS tagger, and check that the aforementioned condition holds prior to projection.Once the source nouns have been annotated, we can train a Maximum Entropy Markov Model (MEMM) classifier. The success of the classification task will largely depend on the set of features we chose to annotate our training data with. Tuning of the classifier and subsequent testing should be done on two separate held-out sets of case-annotated source sentences.2.2. Data We obtained 53K sentences of parallel English-Czech text from Prague English-Czech Dependency Treebank, Čmejrek (2004). The text was already sentenced-aligned – each source sentence aligned to one, and only one, target sentence.Czech morphological annotator was trained on Prague Dependency Treebank 2.0 (PDT 2.0) data. This suite came with a whole ensemble of useful annotators, parsers, and preprocessors.2.3. Tools As mentioned in the previous section, morphological annotator was supplied with the PDT 2.0 data.Word alignment was performed with GIZA++, Och and Ney (2000).Source (English) side part of speech tagging was performed with Stanford's POS tagger, Toutanova et al. (2000, 2003).2.4. Implementation

View Full Document