DOC PREVIEW
MSU CSE 842 - Natural Language Processing
Course Cse 842-
Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

2/7/2011 CSE842, Spring 2011, MSU 1CSE 842Natural Language ProcessingLecture 7: Language Modeling 2/7/2011 CSE842, Spring 2011, MSU 2Language Modeling• We can model the word prediction task as the ability to assess the conditional probability of a word given the previous words in the sequence –P(wn|w1,w2…,wn-1)• We’ll call a statistical model that can assess this a Language Model2/7/2011 CSE842, Spring 2011, MSU 3Language Modeling• How might we go about calculating such a conditional probability? P(the | its water is so transparent that) =Count(its water is so transparent that the)Count(its water is so transparent that)2/7/2011 CSE842, Spring 2011, MSU 4Language Modeling• Unfortunately, for most sequences and for most text collections we won’t get good estimates from this method.– What we’re likely to get is 0. Or worse 0/0.• Clearly, we’ll have to be a little more clever.– Let’s use the chain rule of probability– And a particularly useful independence assumption.2/7/2011 CSE842, Spring 2011, MSU 5Chain RuleThe probability of a word sequence is the probability of a conjunctive event.∏=−−==nkkknnnwwPwwPwwPwwPwPwP111112131211)|()|()...|()|()()(Unfortunately, that’s really not helpful in general. Why?2/7/2011 CSE842, Spring 2011, MSU 6Markov Assumption)|()|(1111−+−−≈nNnnnnwwPwwP•P(wn) can be approximated using only N-1 previous words of context• This lets us collect statistics in practice• Markov models are the class of probabilistic models that assume that we can predict the probability of some future unit without looking too far into the past• Order of a Markov model: length of prior context2/7/2011 CSE842, Spring 2011, MSU 7Simple N-Grams•An N-gram model uses the previous N-1 words to predict the next one:–P(wn| wn-1)– We'll pretty much always be dealing with P(<word> | <some prefix>)• unigrams: P(dog)• bigrams: P(dog | big)• trigrams: P(dog | the big)• quadrigrams: P(dog | the big dopey)2/7/2011 CSE842, Spring 2011, MSU 8How do we get the N-gram probabilities? Use Maximum Likelihood EstimationWhat is Maximum Likelihood Estimation?2/7/2011 CSE842, Spring 2011, MSU 9N-gram models can be trained by counting and normalization)()()|(111−−−=nnnnnwCwwCwwP)()()|(111111−+−−+−−+−=nNnnnNnnNnnwCwwCwwPBigram:Ngram: MLE for N-gram2/7/2011 CSE842, Spring 2011, MSU 10Log Probability • Individual probabilities can be small, and the product of many probabilities can be VERY small.• It is usually best to compute N-gram probabilities in log space, e.g.,• In the end, the actual probability can be restored by taking the anti-log: 2 log(X)= X)|(log)|(log)....(log121212−−∑∏==iiiinwwPwwPwwP2/7/2011 CSE842, Spring 2011, MSU 11Some Useful Empirical Observations• A small number of events occur with high frequency• A large number of events occur with low frequency• You can quickly collect statistics on the high frequency events• You might have to wait an arbitrarily long time to get valid statistics on low frequency events• Some of the zeroes in the table are really zeroes. But others are simply low frequency events you haven't seen yet. 2/7/2011 CSE842, Spring 2011, MSU 12Problem with MLE estimate -ExampleCome across as: 8Come across more: 1Come across a: 1Suppose in a corpus, 10 instances with “come across ..”Using MLE estimation:P(as|come across) = 0.8P(more|come across) = 0.1P(a | come across) = 0.1Any problem?2/7/2011 CSE842, Spring 2011, MSU 13Problem with MLE estimate -ExampleCome across as: 8Come across more: 1Come across a: 1Suppose in a corpus, 10 instances with “come across ..”Using MLE estimation:P(as|come across) = 0.8P(more|come across) = 0.1P(a | come across) = 0.1P(x | come across) = 0 for x not among the above 3 wordsThe MLE does not capture the fact that there are other words which can follow “come across”, e.g., “the”, “some”, etc, but just do not appear in the training set2/7/2011 CSE842, Spring 2011, MSU 14Problem with MLE estimateWhile there are a limited number of frequent events in language, there is a seemingly never ending tail to the probability distribution of rarer events, and we can never collect enough data to get to the end of the tail: Zipf’s lawDevise better estimators that allow for the possibility that we will see events that we didn’t see in the training text2/7/2011 CSE842, Spring 2011, MSU 15Discounting MethodSuppose we’ve seeing the following countsX count (x) )()()|(111−−−=iiiiiMLwCwwCwwPthe 50the flower 16 16/50 the dog 14 14/50the park 8 8/50the tree 6 6/50the woman 3 3/50The path 1 1/50The pond 1 1/50The afternoon 1 1/50The ML estimates are systematically high, particularly for low count items2/7/2011 CSE842, Spring 2011, MSU 16Discounting MethodNow define “discounted” counts, Count*(x) = Count(x)-0.5New estimates: X count (x) count*(x) )()()|(11*1−−−=iiiiiMLwCwwCwwPthe 50the flower 16 15.5 15.5/50 the dog 14 13.5 13.5/50the park 8 7.5 7.5/50the tree 6 5.5 5.5/50the woman 3 2.5 2.5/50The path 1 0.5 0.5/50The pond 1 0.5 0.5/50The afternoon 1 0.5 0.5/502/7/2011 CSE842, Spring 2011, MSU 17• We now have some “miss probability mass”E.g., in our example • Divide the remaining probability mass between word w for which Count(wi-1, w) = 0Discounting Method∑−−−−=wiiiwCountwwCountw)(),(1)(11*1α50/4)(1=−iwα2/7/2011 CSE842, Spring 2011, MSU 18Smoothing TechniquesSmoothing: re-evaluate some zero-probability and


View Full Document

MSU CSE 842 - Natural Language Processing

Course: Cse 842-
Pages: 12
Download Natural Language Processing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Natural Language Processing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Natural Language Processing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?