Stanford CS 224 - Study Notes - D1969920

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Study Notes

DOC PREVIEW

Stanford CS 224 - Study Notes

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Tony Zhang and Steven Bills June 7, 2008 CS 224N Final project write-up Look-up Based Greedy Decoding for Machine Translation For our final project we implemented a machine translation model, using our own greedy decoder and a dictionary created from online lexical resources. We were motivated by the dreadful translations produced by our IBM Model 2. After seeing its output, we speculated that a simple, word-for-word replacement translation model, supplemented with appropriate enhancements, would outperform it significantly. We began this project with the intention of building a translation model from scratch so that we could implement and experiment with our own ideas for translation. We quickly realized that by implementing a simple greedy decoder, we would be able to take advantage of the language and alignment models we had written for previous assignments. Our translation model converts English to French, but our methodology could easily be extended to other language pairs, provided the appropriate lexical resources are available. A seminal paper on greedy decoders is (Germann et al., 2001). These decoders managed to produce near-optimal translations much more quickly than existing priority queue and stack-based decoders. Germann’s original greedy decoder was designed to work with the IBM Model 4 translation model, but it can easily be extended to work with other models. (Turitzin, 2005) adapted the decoder to work with IBM Models 1 and 2, and we adapted it to work with our own translation model. Our decoder uses different (and simpler) mutations than Germann’s and Turitzin’s. In particular, we do not keep track of alignments, only sentences. The major difference between our translation system and others is the sources we use for our translation probabilities. Rather than constructing these probabilities statistically, from a corpus of parallel text, we construct them from freely available online lexical resources, using an approach similar to that found in (Sagot and Fišer, 2008). We also include various forms of pre- and post-processing, such as POS-based stemming and reordering. We compare the performance of our translation system with IBM Models 1 and 2 and with Google’s online translation service. Architecture We decided to write a simple greedy decoder, which ended up being the workhorse of our code. It is constructed with a language model, a part-of-speech tagger, a dictionary and an aligner. To translate an English sentence, the getBestTranslation method is called. This method generates an initial translation for the sentence, then generates a set of “mutations” for the sentence. If one of these mutations is “better” than the current sentence, the current sentence is replaced by it. This process is repeated until there are no longer any mutations which improve the sentence. The initial translation for the sentence is generated as follows. First, the sentence is chunked. The chunking works by looking at each sequence of words in the sentence, and seeing whether that sequence of words has an entry in the “dictionary” (discussed below). If so, the words are chunked into one “word”. Basic reordering is then performed on the resulting sentence using rules based on part-of-speech tags, which are generated by the part-of-speech tagger. This tagger is a very simple MLE model which tags words based on which tag that word had most often in the Brown corpus. We then stem nouns and verbs. The major reason for doing this is that our dictionary rarely has entries for inflected forms of words. For example, “keyboard” is in the dictionary but “keyboards” is not. Stemming “keyboards” to “keyboard” allows us to translate this word. We stem using a very simple algorithm that removes the 's' from plurals and the 'd' from -ed verb forms like “created”. We then use our dictionaryto perform substitutions for each word in the processed sentence. The dictionary is a map from English words to sets of French translations with associated probabilities. For the initial translation we replace each English word with its “best” (most probable) translation. Words that do not appear in the dictionary are translated by themselves, so that “NLP” would be translated as “NLP”. To judge how good a translation is, we use a composite score, based on several factors. We calculate the language model probability of the sentence, using the Third-Order Witten-Bell language model we wrote for PA1. We multiply this by an alignment probability, which is simply the probability of the best alignment between the two sentences. This number is the score for the sentence. We generate mutations in the following manner. Each mutation is a slight modification to the current French sentence. We use several sorts of mutation, similar to those discussed in (Turitzin, 2005). They are summarized in the chart below. Mutation Description Example Insertion Inserts function word “J’aime fromage” Æ “J’aime le fromage” Retranslation Replaces one translation with another “J’ai vu le pion” Æ “J’ai vu le homme” Swap Reverses two adjacent words “le session premier” Æ “le premier session” Deletion Deletes a word “pour à faire” Æ “pour faire” Mutations are generated until no mutation better than the current translation can be found. At this point, some basic post-processing is applied. Contractions of the form “de le” --> “du” are applied. Finally, repeated words are deleted, since they are only correct in a very small number of situations. Design decisions Motivation for building the dictionary The beginning of the translation process is the creation of a French gloss of the English sentence. The way the greedy decoder in PA2 does this (when translating from French to English), is for each French word f, find the English word e that maximizes P(e | f), where the P(e | f) values have been computed with an expectation maximization. Here, translating from English to French, we would want the French word f that maximizes P(f | e) for a given English word. However, using this method of obtaining a gloss can give poor results. We trained our IBM Model 1 on 1653 sentences, then asked for the most likely translations of some English words. As you can see from the list of translations and corresponding probabilities below, some of these translations are quite poor. The best translation is bolded, if present.

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 12 pages.

Stanford CS 224 - Study Notes

Sign up for free to view:

Please select your school