Stanford CS 224 - Study Notes - D926399

Home> Schools> Stanford University> Computer Science (CS) > CS 224> Study Notes

DOC PREVIEW

Stanford CS 224 - Study Notes

School name Stanford University

Course Cs 224- N Natural Language Processing with Deep Learning

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The (Non-)Utility of Structural Analysis in StatisticalMTAnubha Kothari, Uriel Cohen Priva, Hal TilySeptember 21, 20071 TaskModern statistical machine translation systems operationalise the process of findinga translation for a given input sentence by defining a probabilistic distribution overall possible target s entences. A good translation model will assign a large portion ofthe probability mass to sentences which are good translations for the input. Whenasked to translate a French sentence f into English, the system will typically choosethe English sentence e which maximises P (e|f) under the distribution defined bythe probabilistic model.However, state of the art systems make gross simplifications in order to makethis probability distribution computable in practice. As a result, although the dis-tribution over target sentences approximates translation quality, in practice bettertranslations may be assigned lower probabilities than poorer ones. It could be thatthis issue could be avoided by throwing away some of these simplifying assump-tion and using more sophisticated techniques to determine the distribution; but thiswould make it intractable to work out which sentences maximise the distribution.One viable way of augmenting a tractable generative translation model withmore intelligent, computationally intensive techniques is simply to confine the latterto a separate post-processing stage, only applying it to sentences which have beenasisgned a relatively high probability by the simpler model. Rather than just takingthe most probable sentence under the translation model, we can take the top 1000most probable translations, say, and use a more sophisticated analysis to select thebest translation from that smaller set.We will use this technique to test whether a language model that incorp oratesstructural information is useful in determining the best translation. All state of theart machine translation systems use a language model to determine the quality ofthe output sentence as a sentence of the target language. This is then one factor indetermining the probability of that sentence as a good translation of the original.Typically, such language models only use ngram statistics, taking into ac count lo-cal cooccurence relationships between words but nothing more. Such models assign1a high proportion of their probability mass to entirely ungrammatical strings; forexample, the fragment I can opener would be assigned a relatively high probabilityunder an English bigram model due to the relatively high probabilities of its localcomponents, I can and can opener. Although high quality statistical parsers can as-sign a probability value to a sentence that evaluates its global syntactic consistency,doing so is relatively computationally intensive. Instead of trying to replace thengram language model with a parser, we prop ose to find the 1000 best translationsfor each sentence in our dataset using a widely available translation system which in-cludes an ngram language model, and then post-filter that set using a discriminativemodel based on the parser’s output to determine the best translation.2 Tools and dataThe main tools we used were moses and GIZA++: open source machine translationpackages that can be used to create all the different intermediate components of amachine translation solution of the kind prese nted in class:• A pharse by phrase table: GIZA++ creates, and moses s upports supportsa phrase for phrase translation, rather than just word for word and “yield”.This means that a word or a number of words can be translated to a word ora number of words.• A language model: GIZA++ comes with its own language mo del tool, or canwork with other standard language models such as SRILM.• A decoder.• A procedure that finds optimal values for various sub-components of the de-coder. These include:– The weight assigned to the language model,– The phrase for phrase score– Distortion and length penalties: translations in which pharses end upaway from their original position can be penalized, as well as translationsthat that are too short.• Scripts to produce BLEU and NIST scores for the produced sentences.Though we wished to run the entire pro c es s: build a phrase table, adjust theweights and train a language model, we realized that it will take us several daysto acomplish this, and that we could use the ready made intermediate productsprovided for the participants of the ACL 2005 machine translation shared task1.These included:1http://www.statmt.org/wpt05/mt-shared-task/2• A naive initialization file that we converted from pharoe (a non open s ourceMT package) to moses.• A French / English phrase table• A corpus of 2000 test se ntences that we split to a training and validation sets.• A ready made English language model.For our parse probability measure, we used the freely downloadable StanfordParser. After having reconfigured Moses to output the best 1000 translations foreach sentence in the test corpus, we then saved these translations and fed them intothe parser. This gave us a (log) probability value for each.3 Model fittingHaving c ollected the 1000 best translations (filtered for duplicates, etc.) for eachof our 500 test sentences, we then evaluated the usefulness of various probabilisticscores in picking out the “best” translation for a given sentence (the translationwith the highest NIST score). The scores we had at our disposal were as follows,calculated for each translation:• moses: Moses provides an overall moses score which is itself a weighted linearcombination of four individual costs – a phrase-by-phrase translation cost, alanguage model cost, a distortion cost, and a word cost (to penalize transla-tions that are too short). The weights can be tuned on a per-corpus basis, butwe used an untuned version of the m oses score. To correct for this, some ofour models included the next three component scores, also available to us inthe output.• d: The distortion cost used in calculating the translation’s overall moses score.• lm: The language model cost used in calculating moses, a log probability score.• w: The word penalty use d in calculating moses.• parse: The log probability of the be st parse of the translation, as given by theStanford Parser.• unigramScore: A component of the above parse probability, it is a log proba-bility score based on the unigram probabilities of words in the translation.• length: Though not a

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Stanford CS 224 - Study Notes

Sign up for free to view:

Please select your school