Unformatted text preview:

CMSC 723 Computational Linguistics I Session 9 N Gram Language Models Jimmy Lin The iSchool University of Maryland Wednesday October 28 2009 N Gram Language Models What Why LMs assign probabilities to sequences of tokens Statistical machine translation Speech recognition Handwriting recognition Predictive text input How Based on previous word histories n gram consecutive sequences of tokens Huh Noam Chomsky But it must be recognized that the notion probability of a sentence is an entirely useless one under any known interpretation of this term 1969 p 57 Fred Jelinek Anytime a linguist leaves the group the recognition rate goes up 1988 Every time I fire a linguist N Gram Language Models N 1 unigrams This is a sentence Unigrams This is a sentence Sentence of length s how many unigrams N Gram Language Models N 2 bigrams This is a sentence Bigrams This is is a a sentence Sentence of length s how many bigrams N Gram Language Models N 3 trigrams This is a sentence Trigrams This is a is a sentence Sentence of length s how many trigrams Computing Probabilities chain rule Is this practical No Can t keep track of all possible histories of all words Approximating Probabilities Basic idea limit history to fixed number of words N Markov Assumption N 1 Unigram Language Model Relation to HMMs Approximating Probabilities Basic idea limit history to fixed number of words N Markov Assumption N 2 Bigram Language Model Relation to HMMs Approximating Probabilities Basic idea limit history to fixed number of words N Markov Assumption N 3 Trigram Language Model Relation to HMMs Building N Gram Language Models Use existing sentences to compute n gram probability estimates training Terminology N total number of words in training data tokens V vocabulary size or number of unique words types C w1 wk frequency of n gram w1 wk in training data P w1 wk probability estimate for n gram w1 wk P wk w1 wk 1 conditional probability of producing wk given the history w1 wk 1 What s the vocabulary size Vocabulary Size Heaps Law M kT b M is vocabulary size T is collection size number of documents k and b are constants Typically k is between 30 and 100 b is between 0 4 and 0 6 Heaps Law linear in log log space Vocabulary size grows unbounded Heaps Law for RCV1 k 44 b 0 49 First 1 000 020 terms Predicted 38 323 Actual 38 365 Reuters RCV1 collection 806 791 newswire documents Aug 20 1996 August 19 1997 Manning Raghavan Sch tze Introduction to Information Retrieval 2008 Building N Gram Models Start with what s easiest Compute maximum likelihood estimates for individual n gram probabilities Unigram Why not just substitute P wi Bigram Uses relative frequencies as estimates Maximizes the likelihood of the data given the model P D M Example Bigram Language Model s I am Sam s s Sam I am s s I do not like green eggs and ham s Training Corpus P I s 2 3 0 67 P Sam s 1 3 0 33 P am I 2 3 0 67 P do I 1 3 0 33 P s Sam 1 2 0 50 P Sam am 1 2 0 50 Bigram Probability Estimates Note We don t ever cross sentence boundaries Building N Gram Models Start with what s easiest Compute maximum likelihood estimates for individual n gram probabilities Unigram Bigram Let s revisit this issue Why not just substitute P wi Uses relative frequencies as estimates Maximizes the likelihood of the data given the model P D M More Context More Work Larger N more context Lexical co occurrences Local syntactic relations More context is better Larger N more complex model For example assume a vocabulary of 100 000 How many parameters for unigram LM Bigram Trigram Larger N has another more serious and familiar problem Data Sparsity P I s 2 3 0 67 P Sam s 1 3 0 33 P am I 2 3 0 67 P do I 1 3 0 33 P s Sam 1 2 0 50 P Sam am 1 2 0 50 Bigram Probability Estimates P I like ham P I s P like I P ham like P s ham 0 Why Why is this bad Data Sparsity Serious problem in language modeling Becomes more severe as N increases Solution 1 Use larger training corpora What s the tradeoff Can t always work Blame Zipf s Law Looong tail Solution 2 Assign non zero probability to unseen n grams Known as smoothing Smoothing Zeros are bad for any statistical estimator The Robin Hood Philosophy Take from the rich seen ngrams and give to the poor unseen n grams Need better estimators because MLEs give us a lot of zeros A distribution without zeros is smoother And thus also called discounting Critical make sure you still have a valid probability distribution Language modeling theory vs practice Laplace s Law Simplest and oldest smoothing technique Just add 1 to all n gram counts including the unseen ones So what do the revised estimates look like Laplace s Law Probabilities Unigrams Bigrams Careful don t confuse the N s What if we don t know V Laplace s Law Frequencies Expected Frequency Estimates Relative Discount Laplace s Law Bayesian estimator with uniform priors Moves too much mass over to unseen n grams What if we added a fraction of 1 instead Lidstone s Law of Succession Add 0 1 to each count instead The smaller is the lower the mass moved to the unseen n grams 0 no smoothing The case of 0 5 is known as Jeffery Perks Law or Expected Likelihood Estimation How to find the right value of Good Turing Estimator Intuition Use n grams seen once to estimate n grams never seen and so on Compute Nr frequency of frequency r N0 is the number of items with count 0 N1 is the number of items with count 1 Good Turing Estimator For each r compute an expected frequency estimate smoothed count Replace MLE counts of seen bigrams with the expected frequency estimates and use those for probabilities Good Turing Estimator What about an unseen bigram Do we know N0 Can we compute it for bigrams Good Turing Estimator Example r Nr 1 2 3 4 5 6 138741 25413 10531 5997 3565 N0 14585 2 199252 Cunseen N1 N0 0 00065 Punseen N1 N0 N 1 06 x 10 9 Note Assumes mass is uniformly distributed V 14585 Seen bigrams 199252 C person she 2 C person 223 CGT person she 2 1 10531 25413 1 243 P she person CGT person she 223 0 0056 Good Turing Estimator For each r compute an expected frequency estimate smoothed count Replace MLE counts of seen bigrams with the expected frequency estimates and use those for probabilities What if wi isn t observed Good Turing Estimator Can t replace all MLE counts What about rmax Nr 1 0 for r rmax Solution 1 Only replace counts for r k 10 Solution 2 Fit a curve S through the observed r Nr values and use S r …


View Full Document

UMD CMSC 723 - N-Gram Language Models

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Loading Unlocking...
Login

Join to view N-Gram Language Models and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view N-Gram Language Models and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?