UMD CMSC 723 - A Statistical MT Tutorial Workbook - D2612382

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 723> A Statistical MT Tutorial Workbook

DOC PREVIEW

UMD CMSC 723 - A Statistical MT Tutorial Workbook

School name University of Maryland, College Park

Course Cmsc 723- Computational Linguistics I

Pages 31

This preview shows page 1-2-14-15-30-31 out of 31 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 31 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

----------------------------------------------------------------------------------A Statistical MT Tutorial WorkbookKevin Knightprepared in connection with the JHU summer workshopApril 30, 1999----------------------------------------------------------------------------------1. The Overall PlanWe want to automatically analyze existing human sentence translations, with an eye toward buildinggeneral translation rules -- we will use these rules to translate new texts automatically. I know this looks like a thick workbook, but if you take a day to work through it, you will know almostas much about statistical machine translation as anybody!The basic text that this tutorial relies on is Brown et al, “The Mathematics of Statistical MachineTranslation”, Computational Linguistics, 1993. On top of this excellent presentation, I can only addsome perspective and perhaps some sympathy for the poor reader, who has (after all) done nothingwrong.Important terms are underlined throughout!----------------------------------------------------------------------------------2. Basic ProbabilityWe're going to consider that an English sentence e may translate into any French sentence f. Sometranslations are just more likely than others. Here are the basic notations we'll use to formalize “morelikely”:P(e) -- a priori probability. The chance that e happens. For example, if e is the English string “I likesnakes,” then P(e) is the chance that a certain person at a certain time will say “I like snakes” as opposedto saying something else.P(f | e) -- conditional probability. The chance of f given e. For example, if e is the English string “I likesnakes,” and if f is the French string “maison bleue,” then P(f | e) is the chance that upon seeing e, atranslator will produce f. Not bloody likely, in this case.P(e,f) -- joint probability. The chance of e and f both happening. If e and f don't influence each other,then we can write P(e,f) = P(e) * P(f). For example, if e stands for “the first roll of the die comes up 5”and f stands for “the second roll of the die comes up 3,” then P(e,f) = P(e) * P(f) = 1/6 * 1/6 = 1/36. If eand f do influence each other, then we had better write P(e,f) = P(e) * P(f | e). That means: the chancethat “e happens” times the chance that “if e happens, then f happens.” If e and f are strings that aremutual translations, then there's definitely some influence.Exercise. P(e,f) = P(f) * ?All these probabilities are between zero and one, inclusive. A probability of 0.5 means “there's a half achance.”----------------------------------------------------------------------------------3. Sums and ProductsTo represent the addition of integers from 1 to n, we write: n S i = 1 + 2 + 3 + ... + n i=1For the product of integers from 1 to n, we write: n P i = 1 * 2 * 3 * ... * n i=1If there's a factor inside a summation that does not depend on what's being summed over, it can be takenoutside: n n S i * k = k + 2k + 3k + ... + nk = k S i i=1 i=1Exercise. n P i * k = ? i=1Sometimes we'll sum over all strings e. Here are some useful things about probabilities: S P(e) = 1 e S P(e | f) = 1 e P(f) = S P(e) * P(f | e) eYou can read the last one like this: “Suppose f is influenced by some event. Then for each possibleinfluencing event e, we calculate the chance that (1) e happened, and (2) if e happened, then f happened.To cover all possible influencing events, we add up all those chances.----------------------------------------------------------------------------------4. Statistical Machine TranslationGiven a French sentence f, we seek the English sentence e that maximizes P(e | f). (The “most likely”translation). Sometimes we write:argmax P(e | f) eRead this argmax as follows: “the English sentence e, out of all such sentences, which yields the highestvalue for P(e | f). If you want to think of this in terms of computer programs, you could imagine oneprogram that takes a pair of sentences e and f, and returns a probability P(e | f). We will look at such aprogram later on.e ----> +--+ | | ----> P(e|f)f ----> +--+Or, you could imagine another program that takes a sentence f as input, and outputs every conceivablestring ei along with its P(ei | f). This program would take a long time to run, even if you limit Englishtranslations some arbitrary length. +--+ ----> e1, P(e1 | f)f ----> | | ... +--+ ----> en, P(en | f)----------------------------------------------------------------------------------5. The Noisy ChannelMemorize Bayes Rule, it's very important!P(e|f) = P(e) * P(f | e) / P(f)Exercise: Now prove it, using the exercise in section 2.Using Bayes Rule, we can rewrite the expression for the most likely translation:argmax P(e | f) = argmax P(e) * P(f | e) e eExercise: What happened to P(f)?That means the most likely translation e maximizes the product of two terms, (1) the chance thatsomeone would say e in the first place, and (2) if he did say e, the chance that someone else wouldtranslate it into f.The noisy channel works like this. We imagine that someone has e in his head, but by the time it gets onto the printed page it is corrupted by “noise” and becomes f. To recover the most likely e, we reasonabout (1) what kinds of things people say any English, and (2) how English gets turned into French.These are sometimes called “source modeling” and “channel modeling.” People use the noisy channelmetaphor for a lot of engineering problems, like actual noise on telephone transmissions.If you want to think of P(e) in terms of computer programs, you can think of one program that takes anyEnglish string e and outputs a probability P(e). We'll see such a program pretty soon. Or, likewise, youcan think of a program that produces a long list of all sentences ei with their associated probabilities P(ei). +--+ e ----> | | ----> P(e) +--+ +--+ ----> e1, P(e1)| | ...+--+ ----> en, P(en)To think about the P(f | e) factor, imagine another program that takes a pair of sentences e and f, andoutputs P(f | e). Or, likewise, a program that takes a sentence e and produces various sentences fi alongwith corresponding probabilities P(fi |

View Full Document