UMD CMSC 723 - Smoothing - D533111

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 723> Smoothing

UMD CMSC 723 - Smoothing

School name University of Maryland, College Park

Course Cmsc 723- Computational Linguistics I

Pages 15

Download Save

Unformatted text preview:

Smoothing Bonnie Dorr Christof Monz CMSC 723 Introduction to Computational Linguistics Lecture 5 October 6 2004 The Sparse Data Problem Maximum likelihood estimation works fine for data that occur frequently in the training corpus Problem 1 Low frequency n grams If n gram x occurs twice and n gram y occurs once Is x really twice as likely as y Problem 2 Zero counts If n gram y does not occur in the training data does that mean that it should have probability zero The Sparse Data Problem Data sparseness is a serious and frequently occurring problem Probability of a sequence is zero if it contains unseen n grams Smoothing Redistributing Probability Mass Add One Smoothing Most simple smoothing technique For all n grams including unseen ngrams add one to their counts Un smoothed probability c w P w N Add one probability c w 1 P 1 w N V Add One Smoothing P wn wn 1 C wn 1wn C wn 1 P 1 wn wn 1 C wn 1wn 1 C wn 1 V Add One Smoothing ci c wi wi 1 N ci ci 1 N V Add One Smoothing Pro Very simple technique Cons Too much probability mass is shifted towards unseen n grams Probability of frequent n grams is underestimated Probability of rare or unseen n grams is overestimated All unseen n grams are smoothed in the same way Using a smaller added counted does not solve this problem in principle Witten Bell Discounting Probability mass is shifted around depending on the context of words If P wi wi 1 wi m 0 then the smoothed probability PWB wi wi 1 wi m is higher if the sequence wi 1 wi m occurs with many different words wi Witten Bell Smoothing Let s consider bi grams T wi 1 is the number of different words types that occur to the right of wi 1 N wi 1 is the number of all word occurrences tokens to the right of wi 1 Z wi 1 is the number of bigrams in the current data set starting with wi 1 that do not occur in the training data Witten Bell Smoothing If c wi 1 wi 0 P WB T w i 1 w i w i 1 Z w i 1 N T w i 1 If c wi 1 wi 0 P WB c w i 1w i w i w i 1 N w i 1 T w i 1 Witten Bell Smoothing ci N ci ci 1 N V ci N T Z N T N c N T i if ci 0 otherwise Witten Bell Smoothing Witten Bell Smoothing is more conservative when subtracting probability mass Gives rather good estimates Problem If wi 1 and wi did not occur in the training data the smoothed probability is still zero Backoff Smoothing Deleted interpolation If the n gram wi n wi is not in the training data use wi n 1 wi More general combine evidence from different n grams PDI w i w i n w i 1 P w i w i n w i 1 1 PDI w i w i n 1 w i 1 Where lambda is the confidence weight for the longer n gram Compute lambda parameters from held out data Lambdas can be n gram specific Other Smoothing Approaches Good Turing Discounting Re estimate amount of probability mass for zero or low count n grams by looking at n grams with higher counts Kneser Ney Smoothing Similar to Witten Bell smoothing but considers number of word types preceding a word Katz Backoff Smoothing Reverts to shorter ngram contexts if the count for the current ngram is lower than some threshold

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 723 - Smoothing

Sign up for free to view:

Please select your school