This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Lecture 4: Walk-ing the walk; talk-ing the talk Professor Robert C. Berwick [email protected] 6.863J/9.611J SP11 The Menu Bar • Administrivia: Lab 2 out – Note that you are expected to figure out the ‘guts’ of the discounting methods & how to implement them • Today: finish smoothing & word parsing (chapter 3, of JM textbook) • Two level “morpho-phonology” (why ʻtwo levelʼ? – Gold star idea: match representation to the task) • How does it work? Salivation nation; plumbing the Wells • Two sets of finite-state transducers • Hunting through an example so that weʼre not out-foxed • Spy vs. spies; multiple language 6.863J/9.611J SP11 Smoothing: the Robin Hood Answer (and the Dark Arts) 6.863J/9.611J SP11 Steal (probability mass) from the ʻrichʼ And give to the ʻpoorʼ (unseen words) Sum of probability mass must still add up to 1! Smoothing has 2 parts: (1) discounting & (2) backoff • Deals with events that have been observed zero times • Generally improves accuracy of model • Implements Robin Hood by answering 2 questions: 1. Who do we take the probability mass away from, and how much? (= discounting) 2. How do we redistribute the probability mass? (= backoff) Letʼs cover two of the basic methods for discounting and backoff: Add-1; Witten-Bell 6.863J/9.611J SP11Smoothing: the Dark Side of the force… This dark art is why NLP is taught in the engineering school There are actually less hacky smoothing methods, but they require more math and more compute cycles and donʼt usually work any better. So weʼll just motivate and present a few basic methods that are often used in practice. 6.863J/9.611J SP11 Simplest fix • So our estimate for w is now an interpolation ( a weighted backoff sum) between the unigram and the uniform distributions: • This method is called ʻAdd-1ʼ because we add 1 to all the counts, and divide by V to normalize • Itʼs really an interpolation of the unigram (or bigram, …) MLE plus the uniform distribution NV +N·C(w)N+VV +N·1V=C(w)+1N +VAdd-1 smoothing: the Robin Hood Answer 6.863J/9.611J SP11 Steal (probability mass) from the ʻrichʼ Add new 1ʼs to everybody (as if there was a new list V long) Problem: Add-1 adds way too much probability mass to unseen events! (It therefore robs too much from the actually observed events.) V is too big! So how to fix this? Example from restaurants: add 1 6.863J/9.611J SP11 p(wi|wi−1)=1+c(wi−1wi)V + c(wi−1)6.863J/9.611J SP11 p(food | Chinese)= 0.52 dropped to 0.052 !!! Add-One Smoothing (Laplace, 1720) xya 1 1/3 2 2/29 xyb 0 0/3 1 1/29 xyc 0 0/3 1 1/29 xyd 2 2/3 3 3/29 xye 0 0/3 1 1/29 … xyz 0 0/3 1 1/29 Total xy 3 3/3 29 29/29 6.863J/9.611J SP11 Add-One Smoothing Suppose we’re considering 20000 word types, not 26 letters see the abacus 1 1/3 2 2/20003 see the abbot 0 0/3 1 1/20003 see the abduct 0 0/3 1 1/20003 see the above 2 2/3 3 3/20003 see the Abram 0 0/3 1 1/20003 … see the zygote 0 0/3 1 1/20003 Total 3 3/3 20003 20003/20003 “Novel event” = 0-count event (never happened in training data). Here: 19,998 novel events, with total estimated probability 19,998/20,003. So add-one smoothing thinks we are extremely likely to see novel events, rather than words we’ve seen in training data. It thinks this only because we have a big dictionary: 20,000 possible events. Is this a good reason? 6.863J/9.611J SP11 Serious problems with the Add-1 method 1. It assigns far too much probability to rare n-grams as V gets bigger (the size of the vocabulary, the # of distinct words grows) – the amount of probability mass we steal is too large! 2. All unseen n-grams are smoothed in the same way – how we redistribute this probability mass is also wrong • For bigrams, about ½ the probability mass goes to unseen events (e.g., for V=50,000, corpus size N= 5 x 109, V2= 2.5 x 109) • So, we must reduce the count for unseen events to something waay less than 1…. 6.863J/9.611J SP110 10000 20000 30000 40000 50000 60000 70000 8000001234565210869836One fix: just how likely are novel events? n0 * n1 * n2 * n3 * n4 * n5 * n6 * abaringe, Babatinde, cabaret … aback, Babbitt, cabanas … Abbas, babel, Cabot … abdominal, Bach, cabana … aberrant, backlog, cabinets … abdomen, bachelor, Caesar … the EOS total # types = T (purple bars); total # tokens = N (all bars) 6.863J/9.611J SP11 Counts from Brown Corpus (N approx. 1 million tokens) “Diversity Smoothing” 0 5000 10000 15000 20000 250000123456n0 * n1 * n2 * n3 * n4 * n5 * n6 * Counts from Brown Corpus (N approx. 1 million tokens) novel words (in dictionary V but never occur) singletons (occur once) doubletons (occur twice) n2 = # doubleton types n2 * 2 = # doubleton tokens Σr nr = total # types = T (purple bars) Σr (nr * r) = total # tokens = N (all bars) Witten-Bell Smoothing/Discounting Idea 0 5000 10000 15000 20000 250000123456n0 * n1 * n2 * n3 * n4 * n5 * n6 * novel words singletons doubletons If T/N is large, weʼve seen lots of novel types in the past, so we expect lots more. • Imagine scanning the corpus in order. • Each typeʼs first token was novel. • So we saw T novel types (purple). unsmoothed à smoothed 2/N à 2/(N+T) 1/N à 1/(N+T) 0/N à (T/(N+T)) / n0 Intuition: When we see a new type w in training, ++count(w); ++count(novel) So p(novel) is estimated as T/(N+T), divided among n0=Z specific novel types Witten-Bell discounting: in other words • A zero frequency n-gram is simply an event that hasn’t happened yet • When we do see it, then it will be for the first time • We will get an MLE estimate of this ʻfirst timeʼ frequency by counting the # of n-gram types in the training data (the # of types is just the number of first-time observed n-grams) • So the total probability mass of all the zero n-grams is estimated as this count of types T (= first time n-grams) normalized by the number of tokens N plus number of types T: • Intuition: a training corpus consists of a series of events; one event for each token, and one event for each new type; this formula gives a kind of MLE for the probability of a new type event occurring �i:ci=0p∗i=TN + TWitten-Bell discounting • So, answer to the Robin Hood question #1 is: – We take away T rather than V from the ʻrichʼ (and note that the # of types


View Full Document

MIT 6 863J - Lecture Notes

Documents in this Course
N-grams

N-grams

42 pages

Semantics

Semantics

75 pages

Semantics

Semantics

82 pages

Semantics

Semantics

64 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?