Unformatted text preview:

1Smoothing Bonnie Dorr Christof MonzCMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004The Sparse Data Problem Maximum likelihood estimation works fine fordata that occur frequently in the training corpus Problem 1: Low frequency n-grams If n-gram x occurs twice, and n-gram y occurs once.Is x really twice as likely as y? Problem 2: Zero counts If n-gram y does not occur in the training data, doesthat mean that it should have probability zero?The Sparse Data Problem Data sparseness is a serious andfrequently occurring problem Probability of a sequence is zero if itcontains unseen n-gramsSmoothing=Redistributing Probability MassAdd-One Smoothing Most simple smoothing technique For all n-grams, including unseen n-grams, add one to their counts Un-smoothed probability: Add-one probability:! P(w) =c(w)N! P+1( w) =c(w) + 1N + VAdd-One SmoothingP(wn|wn-1) = C(wn-1wn)/C(wn-1)P+1(wn|wn-1) = [C(wn-1wn)+1]/[C(wn-1)+V]2Add-One SmoothingVNN+ci’=(ci+1)ci=c(wi,wi-1)Add-One Smoothing Pro: Very simple technique Cons: Too much probability mass is shifted towardsunseen n-grams Probability of frequent n-grams is underestimated Probability of rare (or unseen) n-grams isoverestimated All unseen n-grams are smoothed in the same way Using a smaller added-counted does not solvethis problem in principleWitten-Bell Discounting Probability mass is shifted around,depending on the context of words If P(wi | wi-1,…,wi-m) = 0, then thesmoothed probability PWB(wi | wi-1,…,wi-m)is higher if the sequence wi-1,…,wi-moccurs with many different words wiWitten-Bell Smoothing Let’s consider bi-grams T(wi-1) is the number of different words(types) that occur to the right of wi-1 N(wi-1) is the number of all wordoccurrences (tokens) to the right of wi-1 Z(wi-1) is the number of bigrams in thecurrent data set starting with wi-1 that do notoccur in the training dataWitten-Bell Smoothing If c(wi-1,wi) = 0 If c(wi-1,wi) > 0! PWB( wi| wi"1) =T(wi"1)Z(wi"1)( N + T(wi"1))! PWB( wi| wi"1) =c(wi"1wi)N(wi"1) + T( wi"1)Witten-Bell SmoothingVNN+ci′=(ci+1)ciTNN+ci′ = T/Z · if ci=0 ci · otherwiseTNN+3Witten-Bell Smoothing Witten-Bell Smoothing is moreconservative when subtracting probabilitymass Gives rather good estimates Problem: If wi-1 and wi did not occur inthe training data the smoothed probabilityis still zeroBackoff Smoothing Deleted interpolation If the n-gram wi-n,…,wi is not in the training data, usewi-(n-1) ,…,wi More general, combine evidence from different n-grams Where lambda is the ‘confidence’ weight for thelonger n-gram Compute lambda parameters from held-out data Lambdas can be n-gram specific! PDI(wi| wi"n,..., wi"1) =#P(wi| wi"n,..., wi"1) + (1"#)PDI(wi| wi"( n "1),..., wi"1)Other Smoothing Approaches Good-Turing Discounting: Re-estimateamount of probability mass for zero (or lowcount) n-grams by looking at n-grams withhigher counts Kneser-Ney Smoothing: Similar to Witten-Bellsmoothing but considers number of word typespreceding a word Katz Backoff Smoothing: Reverts to shorter n-gram contexts if the count for the current n-gram is lower than some


View Full Document

UMD CMSC 723 - Smoothing

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Download Smoothing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Smoothing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Smoothing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?