Natural Language Processing Lecture 7 9 19 2013 Jim Martin Today More Language modeling N grams Smoothing Finish Good Turing Pretty good smoothing Bayesian prior smoothing Word classes Part of speech tagging 01 14 19 Speech and Language Processing Jurafsky and Martin 2 Smoothing Dealing w Zero Counts Back to Shakespeare Recall that Shakespeare produced 300 000 bigram types out of V2 844 million possible bigrams So 99 96 of the possible bigrams were never seen have zero entries in the table Does that mean that any sentence that contains one of those bigrams should have a probability of 0 For generation shannon game it means we ll never emit those bigrams But for analysis it s problematic because if we run across a new bigram in the future then we have no choice but to assign it a probability of zero 01 14 19 Speech and Language Processing Jurafsky and Martin 3 Zero Counts Some of those zeros are really zeros Things that really aren t ever going to happen Fewer of these than you might think On the other hand some of them are just rare events If the training corpus had been a little bigger they would have had a count What would that count be in all likelihood 01 14 19 Speech and Language Processing Jurafsky and Martin 4 Zero Counts Zipf s Law long tail phenomenon A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get good statistics on low frequency events Result Our estimates are necessarily sparse We have no counts at all for the vast number of events we want to estimate Answer Estimate the likelihood of unseen zero count N grams 01 14 19 Speech and Language Processing Jurafsky and Martin 5 Laplace Smoothing Also called Add One smoothing Just add one to all the counts Very simple MLE estimate Laplace estimate Reconstructed counts 01 14 19 Speech and Language Processing Jurafsky and Martin 6 BERP Bigram Counts 01 14 19 Speech and Language Processing Jurafsky and Martin 7 Laplace Smoothed Bigram Counts 01 14 19 Speech and Language Processing Jurafsky and Martin 8 Laplace Smoothed Bigram Probabilities 01 14 19 Speech and Language Processing Jurafsky and Martin 9 Reconstituted Counts 01 14 19 Speech and Language Processing Jurafsky and Martin 10 Reconstituted Counts 2 01 14 19 Speech and Language Processing Jurafsky and Martin 11 Big Change to the Counts C want to went from 608 to 238 P to want from 66 to 26 Discount d c c d for chinese food 10 A 10x reduction So in general Laplace is a blunt instrument Could use more fine grained method add k But Laplace smoothing not generally used for N grams as we have much better methods Despite its flaws Laplace add k is however still used to smooth other probabilistic models in NLP especially 01 14 19 For pilot studies In document classification Information retrieval In domains where the number of zeros isn t so huge Speech and Language Processing Jurafsky and Martin 12 Fun with Unix Thanks to Ken Church Unix for Poets 01 14 19 Speech and Language Processing Jurafsky and Martin 13 Better Smoothing Intuition used by many smoothing algorithms Good Turing Kneser Ney Witten Bell Use the count of things we ve seen once to help estimate the count of things we ve never seen 01 14 19 Speech and Language Processing Jurafsky and Martin 14 One Fish Two Fish Imagine you are fishing There are 8 species carp perch whitefish trout salmon eel catfish bass Not sure where this fishing hole is You have caught up to now 10 carp 3 perch 2 whitefish 1 trout 1 salmon 1 eel 18 fish How likely is it that the next fish to be caught is an eel How likely is it that the next fish caught will be a member of newly seen species Now how likely is it that the next fish caught will be an eel 01 14 19 Slide adapted from Josh Goodman Speech and Language Processing Jurafsky and Martin 15 Good Turing Notation Nx is the frequency of frequency x So N10 1 Number of fish species seen 10 times is 1 carp N1 3 Number of fish species seen 1 is 3 trout salmon eel To estimate the probability of an unseen species Use number of species 3 18 words we ve seen once c0 c1 p0 N1 N All other estimates are adjusted downward to account for unseen probabilities c eel c 1 1 1 1 3 2 3 01 14 19 Slide from Josh Goodman Speech and Language Processing Jurafsky and Martin 16 Bigram Frequencies of Frequencies and GT Re estimates 01 14 19 Speech and Language Processing Jurafsky and Martin 17 Bigram Frequencies of Frequencies and GT Re estimates 01 14 19 3 4 381 642 4 593 Speech and Language Processing Jurafsky and Martin 2 37 18 GT Smoothed Bigram Probabilities 01 14 19 Speech and Language Processing Jurafsky and Martin 19 GT Complications In practice assume large counts c k for some k are reliable Also need all the N k to be non zero so we need to smooth interpolate the Nk counts before computing c from them 01 14 19 Speech and Language Processing Jurafsky and Martin 20 Pretty Good Smoothing Maximum Likelihood Estimation C w1 w2 P w2 w1 C w1 Laplace Smoothing C w1 w2 1 PLaplace w2 w1 C w1 vocab Bayesian prior Smoothing C w1 w2 P w2 PPrior w2 w1 C w1 1 01 14 19 Speech and Language Processing Jurafsky and Martin 21 Pretty Good Smoothing Bayesian prior smoothing C w1 w2 P w2 PPrior w2 w1 C w1 1 01 14 19 Speech and Language Processing Jurafsky and Martin Why is there a 1 here 22 Toolkits With FSAs FSTs Openfst org For language modeling SRILM SRI Language Modeling Toolkit All the bells and whistles you can imagine 01 14 19 Speech and Language Processing Jurafsky and Martin 23 Break HW Questions 01 14 19 Speech and Language Processing Jurafsky and Martin 24 Break Quiz is Thursday Oct 3 Chapters 1 to 6 I ll post specific readings when enough people remind nag me 01 14 19 Speech and Language Processing Jurafsky and Martin 25 Back to Some Linguistics 01 14 19 Speech and Language Processing Jurafsky and Martin 26 Word Classes Parts of Speech 8 ish traditional parts of speech Noun verb adjective preposition adverb article interjection pronoun conjunction etc Also known as parts of speech lexical categories word classes morphological classes lexical tags Lots of debate within linguistics and cognitive science community about the number nature and universality of these We ll completely ignore this debate 01 14 19 Speech and Language Processing Jurafsky and Martin 27 POS examples N noun chair bandwidth pacing V verb study debate munch ADJ adjective purple tall
View Full Document
Unlocking...