SmoothingNever trust a sample under 30Slide 3Smoothing reduces varianceParameter EstimationTerminology: Types vs. TokensHow to Estimate?Smoothing the EstimatesAdd-One SmoothingSlide 10Problem with Add-One SmoothingSlide 12Slide 13Infinite Dictionary?Add-Lambda SmoothingSetting Smoothing ParametersSlide 17Slide 185-fold Cross-Validation (“Jackknifing”)N-fold Cross-Validation (“Leave One Out”)Slide 22Use the backoff, Luke!Early idea: Model averagingMore Ideas for SmoothingHow likely are novel events?How likely are novel events?Slide 28How common are novel events?Slide 30Witten-Bell Smoothing IdeaGood-Turing Smoothing IdeaJustification of Good-TuringWitten-Bell SmoothingGood-Turing SmoothingSlide 36Smoothing + backoffSlide 38Slide 39Smoothing as OptimizationConditional ModelingLinear ScoringWhat features should we use?Log-Linear Conditional Probability (interpret score as a log-prob, up to a constant)Training Slide 46Gradient-based trainingSlide 48Slide 49Maximum EntropySlide 51Slide 52Slide 53Slide 54Slide 55600.465 - Intro to NLP - J. Eisner 1Smoothing There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method.This dark art is why NLP is taught in the engineering school.Never trust a sample under 3020 20020002000000Never trust a sample under 30Smooth out the bumpy histograms to look more like the truth(we hope!)Smoothing reduces variance20 202020Different samples of size 20 vary considerably(though on average, they give the correct bell curve!)600.465 - Intro to NLP - J. Eisner 5Parameter Estimationp(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …) p(h | BOS, BOS)* p(o | BOS, h)* p(r | h, o)* p(s | o, r)* p(e | r, s)* p(s | s, e)* …4470/ 52108395/ 44701417/ 147651573/ 264121610/ 122532044/ 21250trigram model’sparametersvalues of those parameters, as naively estimated from Brown corpus.Word type = distinct vocabulary itemA dictionary is a list of types (once each)Word token = occurrence of that typeA corpus is a list of tokens (each type has many tokens)We’ll estimate probabilities of the dictionary typesby counting the corpus tokens6Terminology: Types vs. Tokensa 100b 0c 0d 200e 0…z 0Total 30026 types300 tokens100 tokens of this type200 tokens of this type 0 tokens of this type(in context)600.465 - Intro to NLP - J. Eisner 7How to Estimate?p(z | xy) = ?Suppose our training data includes… xya ..… xyd …… xyd …but never xyzShould we conclude p(a | xy) = 1/3?p(d | xy) = 2/3?p(z | xy) = 0/3?NO! Absence of xyz might just be bad luck.600.465 - Intro to NLP - J. Eisner 8Smoothing the EstimatesShould we conclude p(a | xy) = 1/3? reduce this p(d | xy) = 2/3? reduce this p(z | xy) = 0/3? increase thisDiscount the positive counts somewhatReallocate that probability to the zeroesEspecially if the denominator is small …1/3 probably too high, 100/300 probably about rightEspecially if numerator is small …1/300 probably too high, 100/300 probably about right600.465 - Intro to NLP - J. Eisner 9Add-One Smoothingxya 1 1/3 2 2/29xyb 0 0/3 1 1/29xyc 0 0/3 1 1/29xyd 2 2/3 3 3/29xye 0 0/3 1 1/29…xyz 0 0/3 1 1/29Total xy3 3/3 29 29/29600.465 - Intro to NLP - J. Eisner 10Add-One Smoothingxya 100 100/300101 101/326xyb 0 0/300 1 1/326xyc 0 0/300 1 1/326xyd 200 200/300201 201/326xye 0 0/300 1 1/326…xyz 0 0/300 1 1/326Total xy300 300/300326 326/326300 observations instead of 3 – better data, less smoothing600.465 - Intro to NLP - J. Eisner 11Problem with Add-One Smoothingxya 1 1/3 2 2/29xyb 0 0/3 1 1/29xyc 0 0/3 1 1/29xyd 2 2/3 3 3/29xye 0 0/3 1 1/29…xyz 0 0/3 1 1/29Total xy3 3/3 29 29/29We’ve been considering just 26 letter types …600.465 - Intro to NLP - J. Eisner 12Problem with Add-One SmoothingSuppose we’re considering 20000 word types, not 26 letterssee the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003see the abduct0 0/3 1 1/20003see the above2 2/3 3 3/20003see the Abram0 0/3 1 1/20003…see the zygote0 0/3 1 1/20003Total3 3/3 2000320003/20003600.465 - Intro to NLP - J. Eisner 13Problem with Add-One SmoothingSuppose we’re considering 20000 word types, not 26 letterssee the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003see the abduct0 0/3 1 1/20003see the above2 2/3 3 3/20003see the Abram0 0/3 1 1/20003…see the zygote0 0/3 1 1/20003Total3 3/3 2000320003/20003“Novel event” = 0-count event (never happened in training data).Here: 19998 novel events, with total estimated probability 19998/20003. So add-one smoothing thinks we are extremely likely to see novel events, rather than words we’ve seen in training data.It thinks this only because we have a big dictionary: 20000 possible events.Is this a good reason?600.465 - Intro to NLP - J. Eisner 14Infinite Dictionary?In fact, aren’t there infinitely many possible word types?see the aaaaa 1 1/3 2 2/(∞+3)see the aaaab 0 0/3 1 1/(∞+3)see the aaaac0 0/3 1 1/(∞+3)see the aaaad2 2/3 3 3/(∞+3)see the aaaae0 0/3 1 1/(∞+3)…see the zzzzz0 0/3 1 1/(∞+3)Total3 3/3(∞+3)(∞+3)/(∞+3)600.465 - Intro to NLP - J. Eisner 15Add-Lambda SmoothingA large dictionary makes novel events too probable.To fix: Instead of adding 1 to all counts, add = 0.01?This gives much less probability to novel events.But how to pick best value for ? That is, how much should we smooth?E.g., how much probability to “set aside” for novel events?Depends on how likely novel events really are!Which may depend on the type of text, size of training corpus, …Can we figure it out from the data?We’ll look at a few methods for deciding how much to smooth.600.465 - Intro to NLP - J. Eisner 16Setting Smoothing ParametersHow to pick best value for ? (in add- smoothing) Try many values & report the one that gets best results?How to measure whether a particular gets good results?Is it fair to measure that on test data (for setting )?Story: Stock scam … Moral: Selective reporting on test data can make a method look artificially good. So it is unethical. Rule: Test data cannot influence system development. No peeking! Use it only to evaluate the final system(s). Report all results on it.TestTrainingGeneral Rule of Experimental Ethics: Never skew anything in your favor.Applies
View Full Document