Unformatted text preview:

SmoothingThe Sparse Data ProblemSlide 3Smoothing=Redistributing Probability MassAdd-One SmoothingSlide 6Slide 7Slide 8Witten-Bell DiscountingWitten-Bell SmoothingSlide 11Slide 12Slide 13Backoff SmoothingOther Smoothing ApproachesSmoothing Bonnie Dorr Christof MonzCMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004The Sparse Data ProblemMaximum likelihood estimation works fine for data that occur frequently in the training corpusProblem 1: Low frequency n-gramsIf n-gram x occurs twice, and n-gram y occurs once. Is x really twice as likely as y?Problem 2: Zero countsIf n-gram y does not occur in the training data, does that mean that it should have probability zero?The Sparse Data ProblemData sparseness is a serious and frequently occurring problemProbability of a sequence is zero if it contains unseen n-gramsSmoothing=Redistributing Probability MassAdd-One SmoothingMost simple smoothing techniqueFor all n-grams, including unseen n-grams, add one to their countsUn-smoothed probability:Add-one probability:€ P(w) =c(w)N€ P+1(w) =c(w) +1N + VAdd-One Smoothing P(wn|wn-1) = C(wn-1wn)/C(wn-1)P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]Add-One Smoothing VNN+ci’=(ci+1)ci=c(wi,wi-1)Add-One SmoothingPro: Very simple techniqueCons:Too much probability mass is shifted towards unseen n-gramsProbability of frequent n-grams is underestimatedProbability of rare (or unseen) n-grams is overestimatedAll unseen n-grams are smoothed in the same wayUsing a smaller added-counted does not solve this problem in principleWitten-Bell DiscountingProbability mass is shifted around, depending on the context of wordsIf P(wi | wi-1,…,wi-m) = 0, then the smoothed probability PWB(wi | wi-1,…,wi-m) is higher if the sequence wi-1,…,wi-m occurs with many different words wiWitten-Bell SmoothingLet’s consider bi-gramsT(wi-1) is the number of different words (types) that occur to the right of wi-1 N(wi-1) is the number of all word occurrences (tokens) to the right of wi-1 Z(wi-1) is the number of bigrams in the current data set starting with wi-1 that do not occur in the training dataWitten-Bell SmoothingIf c(wi-1,wi) = 0If c(wi-1,wi) > 0€ PWB(wi| wi−1) =T(wi−1)Z(wi−1)(N + T(wi−1))€ PWB(wi| wi−1) =c(wi−1wi)N(wi−1) + T(wi−1)Witten-Bell Smoothing VNN+ci′=(ci+1)ciTNN+ci′ = T/Z · if ci=0 ci · otherwiseTNN+Witten-Bell SmoothingWitten-Bell Smoothing is more conservative when subtracting probability massGives rather good estimatesProblem: If wi-1 and wi did not occur in the training data the smoothed probability is still zeroBackoff SmoothingDeleted interpolationIf the n-gram wi-n,…,wi is not in the training data, use wi-(n-1) ,…,wi More general, combine evidence from different n-gramsWhere lambda is the ‘confidence’ weight for the longer n-gramCompute lambda parameters from held-out dataLambdas can be n-gram specific€ PDI(wi| wi−n,...,wi−1) = λP(wi| wi−n,...,wi−1) + (1− λ)PDI(wi| wi−( n−1),... ,wi−1)Other Smoothing ApproachesGood-Turing Discounting: Re-estimate amount of probability mass for zero (or low count) n-grams by looking at n-grams with higher countsKneser-Ney Smoothing: Similar to Witten-Bell smoothing but considers number of word types preceding a wordKatz Backoff Smoothing: Reverts to shorter n-gram contexts if the count for the current n-gram is lower than some


View Full Document

UMD CMSC 723 - Smoothing

Documents in this Course
Lecture 9

Lecture 9

12 pages

Load more
Download Smoothing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Smoothing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Smoothing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?