The Sparse Data Problem Smoothing Bonnie Dorr Christof Monz Maximum likelihood estimation works fine for data that occur frequently in the training corpus Problem 1 Low frequency n grams CMSC 723 Introduction to Computational Linguistics Lecture 5 October 6 2004 The Sparse Data Problem If n gram x occurs twice and n gram y occurs once Is x really twice as likely as y Problem 2 Zero counts If n gram y does not occur in the training data does that mean that it should have probability zero Smoothing Redistributing Probability Mass Data sparseness is a serious and frequently occurring problem Probability of a sequence is zero if it contains unseen n grams Add One Smoothing Most simple smoothing technique For all n grams including unseen ngrams add one to their counts Un smoothed probability c w P w N Add One Smoothing P wn wn 1 C wn 1w n C wn 1 P 1 w n wn 1 C wn 1wn 1 C wn 1 V Add one probability c w 1 P 1 w N V 1 Add One Smoothing Add One Smoothing Pro Very simple technique Cons ci c w i wi 1 N ci c i 1 N V Too much probability mass is shifted towards unseen n grams Probability of frequent n grams is underestimated Probability of rare or unseen n grams is overestimated All unseen n grams are smoothed in the same way Using a smaller added counted does not solve this problem in principle Witten Bell Discounting Witten Bell Smoothing Probability mass is shifted around depending on the context of words If P wi wi 1 wi m 0 then the smoothed probability PWB wi wi 1 wi m is higher if the sequence wi 1 wi m occurs with many different words wi Witten Bell Smoothing Witten Bell Smoothing If c wi 1 wi 0 P WB Let s consider bi grams T wi 1 is the number of different words types that occur to the right of wi 1 N wi 1 is the number of all word occurrences tokens to the right of wi 1 Z wi 1 is the number of bigrams in the current data set starting with wi 1 that do not occur in the training data ci T w i 1 w i w i 1 Z w i 1 N T w i 1 If c wi 1 wi 0 P WB w i w i 1 N ci ci 1 N V c w i 1w i N w i 1 T w i 1 N ci T Z N T if c i 0 ci otherwise N N T 2 Witten Bell Smoothing Backoff Smoothing Witten Bell Smoothing is more conservative when subtracting probability mass Gives rather good estimates Problem If wi 1 and wi did not occur in the training data the smoothed probability is still zero Deleted interpolation If the n gram wi n w i is not in the training data use wi n 1 wi More general combine evidence from different ngrams PDI w i w i n w i 1 P w i w i n w i 1 1 PDI w i w i n 1 w i 1 Where lambda is the confidence weight for the longer n gram Compute lambda parameters from held out data Lambdas can be n gram specific Other Smoothing Approaches Good Turing Discounting Re estimate amount of probability mass for zero or low count n grams by looking at n grams with higher counts Kneser Ney Smoothing Similar to Witten Bell smoothing but considers number of word types preceding a word Katz Backoff Smoothing Reverts to shorter ngram contexts if the count for the current ngram is lower than some threshold 3
View Full Document
Unlocking...