UMD CMSC 723 - N-Gram Language Models - D2514883

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 723> N-Gram Language Models

UMD CMSC 723 - N-Gram Language Models

School name University of Maryland, College Park

Course Cmsc 723- Computational Linguistics I

Pages 48

Download Save

Unformatted text preview:

Slide 1N-Gram Language ModelsHuh?N-Gram Language ModelsN-Gram Language ModelsN-Gram Language ModelsComputing ProbabilitiesApproximating ProbabilitiesApproximating ProbabilitiesApproximating ProbabilitiesBuilding N-Gram Language ModelsVocabulary Size: Heaps’ LawHeaps’ Law for RCV1Building N-Gram ModelsExample: Bigram Language ModelBuilding N-Gram ModelsMore Context, More WorkData SparsityData SparsitySmoothingLaplace’s LawLaplace’s Law: ProbabilitiesLaplace’s Law: FrequenciesLaplace’s LawLidstone’s Law of SuccessionGood-Turing EstimatorGood-Turing EstimatorGood-Turing EstimatorGood-Turing Estimator: ExampleGood-Turing EstimatorGood-Turing EstimatorCombining EstimatorsLinear MLE InterpolationLinear MLE InterpolationBackoff ModelsBackoff ModelsKatz BackoffKatz Backoff (from textbook)Katz BackoffKneser-Ney SmoothingKneser-Ney SmoothingKneser-Ney SmoothingExplicitly Modeling OOVEvaluating Language ModelsComputing PerplexityPractical EvaluationTypical “State of the Art” LMsTake-Away MessagesN-Gram Language ModelsCMSC 723: Computational Linguistics I ― Session #9Jimmy LinThe iSchoolUniversity of MarylandWednesday, October 28, 2009N-Gram Language ModelsWhat? LMs assign probabilities to sequences of tokensWhy?Statistical machine translationSpeech recognitionHandwriting recognitionPredictive text inputHow?Based on previous word historiesn-gram = consecutive sequences of tokensHuh?Noam Chomsky Fred JelinekBut it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. (1969, p. 57)Anytime a linguist leaves the group the recognition rate goes up. (1988)Every time I fire a linguist…This is a sentenceN-Gram Language ModelsN=1 (unigrams)Unigrams:This,is, a, sentenceSentence of length s, how many unigrams?This is a sentenceN-Gram Language ModelsBigrams:This is,is a, a sentenceN=2 (bigrams)Sentence of length s, how many bigrams?This is a sentenceN-Gram Language ModelsTrigrams:This is a,is a sentenceN=3 (trigrams)Sentence of length s, how many trigrams?Computing ProbabilitiesIs this practical?No! Can’t keep track of all possible histories of all words![chain rule]Approximating ProbabilitiesBasic idea: limit history to fixed number of words N(Markov Assumption)N=1: Unigram Language ModelRelation to HMMs?Approximating ProbabilitiesBasic idea: limit history to fixed number of words N(Markov Assumption)N=2: Bigram Language ModelRelation to HMMs?Approximating ProbabilitiesBasic idea: limit history to fixed number of words N(Markov Assumption)N=3: Trigram Language ModelRelation to HMMs?Building N-Gram Language ModelsUse existing sentences to compute n-gram probability estimates (training)Terminology:N = total number of words in training data (tokens)V = vocabulary size or number of unique words (types)C(w1,...,wk) = frequency of n-gram w1, ..., wk in training dataP(w1, ..., wk) = probability estimate for n-gram w1 ... wkP(wk|w1, ..., wk-1) = conditional probability of producing wk given the history w1, ... wk-1What’s the vocabulary size?Vocabulary Size: Heaps’ LawHeaps’ Law: linear in log-log spaceVocabulary size grows unbounded!bkTM M is vocabulary sizeT is collection size (number of documents)k and b are constantsTypically, k is between 30 and 100, b is between 0.4 and 0.6Heaps’ Law for RCV1Reuters-RCV1 collection: 806,791 newswire documents (Aug 20, 1996-August 19, 1997)k = 44b = 0.49First 1,000,020 terms: Predicted = 38,323 Actual = 38,365Manning, Raghavan, Schütze, Introduction to Information Retrieval (2008)Building N-Gram ModelsStart with what’s easiest!Compute maximum likelihood estimates for individual n-gram probabilitiesUnigram:Bigram: Uses relative frequencies as estimatesMaximizes the likelihood of the data given the model P(D|M)Why not just substitute P(wi) ?Example: Bigram Language ModelNote: We don’t ever cross sentence boundariesI am SamSam I amI do not like green eggs and ham<s><s><s></s></s></s>Training CorpusP( I | <s> ) = 2/3 = 0.67 P( Sam | <s> ) = 1/3 = 0.33P( am | I ) = 2/3 = 0.67 P( do | I ) = 1/3 = 0.33P( </s> | Sam )= 1/2 = 0.50 P( Sam | am) = 1/2 = 0.50...Bigram Probability EstimatesBuilding N-Gram ModelsStart with what’s easiest!Compute maximum likelihood estimates for individual n-gram probabilitiesUnigram:Bigram: Uses relative frequencies as estimatesMaximizes the likelihood of the data given the model P(D|M)Why not just substitute P(wi) ?Let’s revisit this issue…More Context, More WorkLarger N = more contextLexical co-occurrencesLocal syntactic relationsMore context is better?Larger N = more complex modelFor example, assume a vocabulary of 100,000How many parameters for unigram LM? Bigram? Trigram?Larger N has another more serious and familiar problem!Data SparsityP(I like ham)= P( I | <s> ) P( like | I ) P( ham | like ) P( </s> | ham )= 0P( I | <s> ) = 2/3 = 0.67 P( Sam | <s> ) = 1/3 = 0.33P( am | I ) = 2/3 = 0.67 P( do | I ) = 1/3 = 0.33P( </s> | Sam )= 1/2 = 0.50 P( Sam | am) = 1/2 = 0.50...Bigram Probability EstimatesWhy?Why is this bad?Data SparsitySerious problem in language modeling!Becomes more severe as N increasesWhat’s the tradeoff?Solution 1: Use larger training corporaCan’t always work... Blame Zipf’s Law (Looong tail)Solution 2: Assign non-zero probability to unseen n-gramsKnown as smoothingSmoothingZeros are bad for any statistical estimatorNeed better estimators because MLEs give us a lot of zerosA distribution without zeros is “smoother”The Robin Hood Philosophy: Take from the rich (seen n-grams) and give to the poor (unseen n-grams)And thus also called discountingCritical: make sure you still have a valid probability distribution!Language modeling: theory vs. practiceLaplace’s LawSimplest and oldest smoothing techniqueJust add 1 to all n-gram counts including the unseen onesSo, what do the revised estimates look like?Laplace’s Law: ProbabilitiesUnigramsBigramsWhat if we don’t know V?Careful, don’t confuse the N’s!Laplace’s Law: FrequenciesExpected Frequency EstimatesRelative DiscountLaplace’s LawBayesian estimator with uniform priorsMoves too much mass over to unseen n-gramsWhat if we added a fraction of 1 instead?Lidstone’s Law of SuccessionAdd 0 < γ < 1 to each count insteadThe smaller γ is, the lower the mass

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 723 - N-Gram Language Models

Sign up for free to view:

Please select your school