SmoothingThe Sparse Data ProblemSlide 3Smoothing=Redistributing Probability MassAdd-One SmoothingSlide 6Slide 7Slide 8Witten-Bell DiscountingWitten-Bell SmoothingSlide 11Slide 12Slide 13Backoff SmoothingOther Smoothing ApproachesSmoothing Bonnie Dorr Christof MonzCMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004The Sparse Data ProblemMaximum likelihood estimation works fine for data that occur frequently in the training corpusProblem 1: Low frequency n-gramsIf n-gram x occurs twice, and n-gram y occurs once. Is x really twice as likely as y?Problem 2: Zero countsIf n-gram y does not occur in the training data, does that mean that it should have probability zero?The Sparse Data ProblemData sparseness is a serious and frequently occurring problemProbability of a sequence is zero if it contains unseen n-gramsSmoothing=Redistributing Probability MassAdd-One SmoothingMost simple smoothing techniqueFor all n-grams, including unseen n-grams, add one to their countsUn-smoothed probability:Add-one probability:€ P(w) =c(w)N€ P+1(w) =c(w) +1N + VAdd-One Smoothing P(wn|wn-1) = C(wn-1wn)/C(wn-1)P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]Add-One Smoothing VNN+ci’=(ci+1)ci=c(wi,wi-1)Add-One SmoothingPro: Very simple techniqueCons:Too much probability mass is shifted towards unseen n-gramsProbability of frequent n-grams is underestimatedProbability of rare (or unseen) n-grams is overestimatedAll unseen n-grams are smoothed in the same wayUsing a smaller added-counted does not solve this problem in principleWitten-Bell DiscountingProbability mass is shifted around, depending on the context of wordsIf P(wi | wi-1,…,wi-m) = 0, then the smoothed probability PWB(wi | wi-1,…,wi-m) is higher if the sequence wi-1,…,wi-m occurs with many different words wiWitten-Bell SmoothingLet’s consider bi-gramsT(wi-1) is the number of different words (types) that occur to the right of wi-1 N(wi-1) is the number of all word occurrences (tokens) to the right of wi-1 Z(wi-1) is the number of bigrams in the current data set starting with wi-1 that do not occur in the training dataWitten-Bell SmoothingIf c(wi-1,wi) = 0If c(wi-1,wi) > 0€ PWB(wi| wi−1) =T(wi−1)Z(wi−1)(N + T(wi−1))€ PWB(wi| wi−1) =c(wi−1wi)N(wi−1) + T(wi−1)Witten-Bell Smoothing VNN+ci′=(ci+1)ciTNN+ci′ = T/Z · if ci=0 ci · otherwiseTNN+Witten-Bell SmoothingWitten-Bell Smoothing is more conservative when subtracting probability massGives rather good estimatesProblem: If wi-1 and wi did not occur in the training data the smoothed probability is still zeroBackoff SmoothingDeleted interpolationIf the n-gram wi-n,…,wi is not in the training data, use wi-(n-1) ,…,wi More general, combine evidence from different n-gramsWhere lambda is the ‘confidence’ weight for the longer n-gramCompute lambda parameters from held-out dataLambdas can be n-gram specific€ PDI(wi| wi−n,...,wi−1) = λP(wi| wi−n,...,wi−1) + (1− λ)PDI(wi| wi−( n−1),... ,wi−1)Other Smoothing ApproachesGood-Turing Discounting: Re-estimate amount of probability mass for zero (or low count) n-grams by looking at n-grams with higher countsKneser-Ney Smoothing: Similar to Witten-Bell smoothing but considers number of word types preceding a wordKatz Backoff Smoothing: Reverts to shorter n-gram contexts if the count for the current n-gram is lower than some
View Full Document