1Smoothing Bonnie Dorr Christof MonzCMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004The Sparse Data Problem Maximum likelihood estimation works fine fordata that occur frequently in the training corpus Problem 1: Low frequency n-grams If n-gram x occurs twice, and n-gram y occurs once.Is x really twice as likely as y? Problem 2: Zero counts If n-gram y does not occur in the training data, doesthat mean that it should have probability zero?The Sparse Data Problem Data sparseness is a serious andfrequently occurring problem Probability of a sequence is zero if itcontains unseen n-gramsSmoothing=Redistributing Probability MassAdd-One Smoothing Most simple smoothing technique For all n-grams, including unseen n-grams, add one to their counts Un-smoothed probability: Add-one probability:! P(w) =c(w)N! P+1( w) =c(w) + 1N + VAdd-One SmoothingP(wn|wn-1) = C(wn-1wn)/C(wn-1)P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]2Add-One SmoothingVNN+ci’=(ci+1)ci=c(wi,wi-1)Add-One Smoothing Pro: Very simple technique Cons: Too much probability mass is shifted towardsunseen n-grams Probability of frequent n-grams is underestimated Probability of rare (or unseen) n-grams isoverestimated All unseen n-grams are smoothed in the same way Using a smaller added-counted does not solvethis problem in principleWitten-Bell Discounting Probability mass is shifted around,depending on the context of words If P(wi | wi-1,…,wi-m) = 0, then thesmoothed probability PWB(wi | wi-1,…,wi-m)is higher if the sequence wi-1,…,wi-moccurs with many different words wiWitten-Bell Smoothing Let’s consider bi-grams T(wi-1) is the number of different words(types) that occur to the right of wi-1 N(wi-1) is the number of all wordoccurrences (tokens) to the right of wi-1 Z(wi-1) is the number of bigrams in thecurrent data set starting with wi-1 that do notoccur in the training dataWitten-Bell Smoothing If c(wi-1,wi) = 0 If c(wi-1,wi) > 0! PWB( wi| wi"1) =T(wi"1)Z(wi"1)( N + T(wi"1))! PWB( wi| wi"1) =c(wi"1wi)N(wi"1) + T( wi"1)Witten-Bell SmoothingVNN+ci′=(ci+1)ciTNN+ci′ = T/Z · if ci=0 ci · otherwiseTNN+3Witten-Bell Smoothing Witten-Bell Smoothing is moreconservative when subtracting probabilitymass Gives rather good estimates Problem: If wi-1 and wi did not occur inthe training data the smoothed probabilityis still zeroBackoff Smoothing Deleted interpolation If the n-gram wi-n,…,wi is not in the training data, usewi-(n-1) ,…,wi More general, combine evidence from different n-grams Where lambda is the ‘confidence’ weight for thelonger n-gram Compute lambda parameters from held-out data Lambdas can be n-gram specific! PDI(wi| wi"n,..., wi"1) =#P(wi| wi"n,..., wi"1) + (1"#)PDI(wi| wi"( n "1),..., wi"1)Other Smoothing Approaches Good-Turing Discounting: Re-estimateamount of probability mass for zero (or lowcount) n-grams by looking at n-grams withhigher counts Kneser-Ney Smoothing: Similar to Witten-Bellsmoothing but considers number of word typespreceding a word Katz Backoff Smoothing: Reverts to shorter n-gram contexts if the count for the current n-gram is lower than some
View Full Document