DOC PREVIEW
Stanford CS 224 - Case Prediction for Morphologically Poor Languages

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1. Introduction1.1. Motivation and related work 2. Methodology2.1. Pipeline overview 2.2. Data 2.3. Tools 2.4. Implementation details 2.5. Data after training3. MEMM Classifier3.1. Theory and implementation 3.2. Baseline performance4. Features4.1. Orthographic features4.2. Context features4.3. POS tags 4.4. POS tags + Context5. Results5.1. Precision vs Recall 5.2. Evaluation soundness 5.3. Limitations 6. Conclusions and future work7. ReferencesCase Prediction for Morphologically Poor LanguagesAnton [email protected] the context of machine translation, languages can take on implicit morphology. For instance, nouns in a morphologically poor language can be assigned implicit cases when being translated into a morphologically rich language. We investigated a technique of predicting cases for English nouns, in the context of English-Czech translation. We used a parser trained on Czech treebank to annotate the target side of the word-aligned parallel corpus. The cases were then projected from Czech into English via the alignments. Finally, a maximum entropy classifier was trained to predict cases of English nouns in unseen sentences. We evaluated our technique on a held-out set of word-aligned data, achieving 62.5 precision, and 20.9 recall (F-score of 31.4). Extensive error analysis showed that contextual and POS features acted as most useful for the classification task. Overall, we showed that one can produce a generalized case annotator for a morphologically poor language, given a morphologically rich one. The achieved precision suggests this technique might be useful for the task of machine translation.1. IntroductionStatistical machine translation relies on large amounts of training data in order to produce adequate translation quality. Unfortunately, for morphologically rich target languages, data sparsity can often cause poor agreement within sentences. Namely, the same word on the source side can be translated as several differently inflected words on the target side. It is, therefore, desirable to disambiguate the morphological information of the source, prior to translation.In this paper, we attempt to enrich the morphologically poor source text by annotating the nouns with implicit cases, given a morphologically rich target language. Throughout the paper, we will be working with an English-Czech system.1.1. Motivation and related work To get a better idea of the impact of errors caused by poor morphological choices, consider the error classification for an English-Greek system of A. Eleftherios and P. Koehn (2008), shown in Figure 1.Figure 1. Classification of errors on an English-Greek system as suggested by Vilar et al. (2006) Note that over 40% of errors are caused by incorrect word form. We chose to focus on a particular subset of this problem – noun cases. Although this is an English-Greek system, it is very likely that these problems persist with other morphologically rich languages as well.Our task is to predict cases of nouns in unseen English sentences. Although a lot of work has been done on source annotation, most of the efforts, such as Avramidis and Koehn (2008) and Krujiffova et al. (2006), have been focused around rule-based or manual methods. Although such methods achieve good accuracy and high annotation coverage, they require substantial knowledge of the target language, and therefore do not scale well with the increasing number of languages demanded from statistical machine translation. Kujiffova et al. (2006) substitutes projection for manual rule generation, but still uses manually aligned data, which is not immediately scalable.Our goal is to produce a method of annotating English nouns with cases, given Czech as a target language. We would like this method to be target-language-agnostic, to ensure scalability, and involve no manual or rule-based annotations. Such an annotator can then be useful for the task of machine translation via the factored model, described in Avramidis and Koehn (2008), and Hoang and Kohen (2007).2. MethodologyIn this section we review the end-to-end pipeline of training a noun case annotator for English sentences. We also discuss data and tools involved in building this pipeline.2.1. Pipeline overview Since English does not have noun cases, we must project this information from Czech. Thus, we begin with a large set of English-Czech parallel data. First, we perform word alignment on the sentence level. Then, we annotate Czech side with morphological information, by employing a parser trained on treebank data. Finally, we follow the alignment links from Czech to English, projecting the cases from Czech nouns into English nouns. In order to ensure that projection only happens when nouns align to nouns, we also annotate the English side with a POS tagger, and check that the aforementioned condition holds prior to projection.Once the source nouns have been annotated, we can train a Maximum Entropy Markov Model (MEMM) classifier. The success of the classification task will largely depend on the set of features we chose to annotate our training data with. Tuning of the classifier and subsequent testing should be done on two separate held-out sets of case-annotated source sentences.2.2. Data We obtained 53K sentences of parallel English-Czech text from Prague English-Czech Dependency Treebank, Čmejrek (2004). The text was already sentenced-aligned – each source sentence aligned to one, and only one, target sentence.Czech morphological annotator was trained on Prague Dependency Treebank 2.0 (PDT 2.0) data. This suite came with a whole ensemble of useful annotators, parsers, and preprocessors.2.3. Tools As mentioned in the previous section, morphological annotator was supplied with the PDT 2.0 data.Word alignment was performed with GIZA++, Och and Ney (2000).Source (English) side part of speech tagging was performed with Stanford's POS tagger, Toutanova et al. (2000, 2003).2.4. Implementation


View Full Document

Stanford CS 224 - Case Prediction for Morphologically Poor Languages

Documents in this Course
Load more
Download Case Prediction for Morphologically Poor Languages
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Case Prediction for Morphologically Poor Languages and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Case Prediction for Morphologically Poor Languages 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?