DOC PREVIEW
Experiments with a Noun-Phrase driven Statistical Machine Translation System

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Experiments with a Noun-Phrase driven Statistical Machine Translation System Sanjika Hewavitharana, Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh PA 15213 USA {sanjika,alavie+,vogel+}@cs.cmu.edu Abstract This paper presents a noun phrase driven two-level statistical machine translation system. Noun phrases (NPs) are used as the unit of decomposition to build a two level hierarchy of phrases. English noun phrases are identified using a parser. The corresponding translations are induced using a statistical word alignment model. Identified noun phrase pairs in the training corpus are replaced with a tag to produce a NP tagged corpus. This corpus is then used to extract phrase translation pairs. Both NP translations and NP-tagged phrases are used in a two-level translation decoder: NP translations tag NPs in the first level, where NP-tagged phrases match across NPs to produce translations in the second level. The two-level system shows significant improvements over a baseline SMT system. It also produces longer matching phrases due to the generalization introduced by tagging NPs. 1. Introduction When using statistical machine translation (SMT) systems, we often notice that the phrases used to construct the translations are rather short. On average these phrases are less than two words long. This is in spite of that fact that some phrase extraction methods allow the extraction of arbitrarily long phrases. The main reason for this behavior is data sparseness; long exact matching phrases are relatively rare in the training data. In the decoder, these phrases have to compete with abundant shorter phrases. Due to this reason, Koehn et al. (2003) find that phrases longer than three words give little performance improvement. However, with limited reordering strategies used in most of the statistical machines translation systems, a combination of small short phrases does not always generate the desired translation. Zhang (2005) shows improved translation performance by using phrases of arbitrary length. Hierarchical models, such as the Hiero system (Chiang, 2005), that uses phrases with words as well as subphrases have shown better performance than standard phrase based systems. In this paper, we investigate a simplified two-level machine translations system that uses a linguistically motivated phrase decomposition. We think noun phrases (NPs) are good candidates for a hierarchical system. Semantically noun phrases describe objects and concepts using one or more nouns and adjectives. The vast majority of words in a language are nouns and hence NPs appear frequently in sentences. Noun phrases can often be translated independently into other languages irrespective of the context they appear. They tend to appear as coherent units in many languages. When using NPs as the unit of decomposition, we force it to be translated as an NP in the target language. Although this might not always be the best choice, as Koehn (2003) shows, it is not a harmful restriction. We integrate the two-level phrases into a phrase based SMT decoder with minor modifications. It involves the following steps: • Identify NPs on both sides of the parallel training corpus and generate an NP translation table • Tag NPs in the training corpus by replacing them with a special tag “@NP” • Extract phrase translation pairs from the NP-tagged corpus • Use the extracted NP translation table and NP-tagged phrases in a two-level decoder to translate new sentences. In the next chapter we describe each of the above steps in more detail. We then present the experimental results and our conclusions. 2. System We build a translations system that translates Arabic text into English. Our training data consists of parallel corpora primarily of newswire genre available from LDC. Table 1 shows the statistics for the training data after it was pre-processed and English side lower-cased. 2.1 Generating NP translations The first step is to identify noun phrases in the training corpus. Essentially, we want to identify corresponding NP pairs in Arabic and English sides, and build an NP translation table. To achieve this we use a parser to extract NPs from one side of the text and a word-alignment model to induce the corresponding NPs on the other side of the parallel text.As our system translates Arabic text into English, it would be logical to start with Arabic; parsing Arabic side and extracting corresponding English NP translations. However, the Arabic parsers available did not produce desired accuracy. Therefore we use Charniak’s parser (Charniak, 2000) to parse English side of the training data. From the resulting parse trees we extract base NPs; i.e. NPs that do not contain other NPs embedded in them. As mentioned in the previous section these NPs are fairly short and are good candidates for a hierarchical system. Arabic English Sentences 135K 135K Tokens 3.5M 4.3M Vocabulary 145K 63K Table 1: Training data statistics We generate IBM model 3 (Brown et al., 1993) alignments by running GIZA++ (Och and Ney, 2003) with the parallel text. GIZA++ training is done for both directions and the word alignments are generated by the intersection of the two. For each English NP, we search the aligned corpus for sentences that contain the NP and read off the alignment as its translation. To compensate for alignment errors we also include partial alignments as follows: We find maximum (max) and minimum (min) Arabic word indices that are aligned to the words in the English NP. All the Arabic words between min and max are considered to be the translation of the English NP. We filter out unbalance NP translation pairs by removing entries that have a length ratio (# Arabic words / # English words) over two. Table 2 shows the details of extracted NPs. As seen in the table, translations for some English NPs are not found due to alignment errors. English NPs 325K Translations found 260K After filtering 205K Table 2: Extracted NP statistics A sample of the extracted NP translation table is shown in Figure 1. Each line contains an Arabic NP and its English equivalent separated by a hash mark. An NP in the table may contain multiple translation alternatives. Table 3 gives the length distribution of extracted NP translations. This was calculated for Arabic and English sides independently. The average length of an Arabic NP is 2 words while the average


Experiments with a Noun-Phrase driven Statistical Machine Translation System

Download Experiments with a Noun-Phrase driven Statistical Machine Translation System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Experiments with a Noun-Phrase driven Statistical Machine Translation System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Experiments with a Noun-Phrase driven Statistical Machine Translation System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?