CSCI 5832 Natural Language Processing Jim Martin Lecture 8 2 7 08 1 Today 2 7 Finish remaining LM issues Smoothing Backoff and Interpolation Parts of Speech POS Tagging HMMs and Viterbi 2 2 7 08 Laplace smoothing Also called add one smoothing Just add one to all the counts Very simple MLE estimate Laplace estimate Reconstructed counts 3 2 7 08 1 Laplace smoothed bigram counts 4 2 7 08 Laplace smoothed bigrams 5 2 7 08 Reconstituted counts 6 2 7 08 2 Big Changes to Counts C count to went from 608 to 238 P to want from 66 to 26 Discount d c c d for chinese food 10 A 10x reduction So in general Laplace is a blunt instrument Could use more fine grained method add k Despite its flaws Laplace add k is however still used to smooth other probabilistic models in NLP especially For pilot studies in domains where the number of zeros isn t so huge 7 2 7 08 Better Discounting Methods Intuition used by many smoothing algorithms Good Turing Kneser Ney Witten Bell Is to use the count of things we ve seen once to help estimate the count of things we ve never seen 8 2 7 08 Good Turing Imagine you are fishing There are 8 species carp perch whitefish trout salmon eel catfish bass You have caught 10 carp 3 perch 2 whitefish 1 trout 1 salmon 1 eel 18 fish tokens 6 species types How likely is it that you ll next see another trout 9 2 7 08 3 Good Turing Now how likely is it that next species is new i e catfish or bass There were 18 distinct events 3 of those represent singleton species 3 18 10 2 7 08 Good Turing But that 3 18s isn t represented in our probability mass Certainly not the one we used for estimating another trout 11 2 7 08 Good Turing Intuition Notation Nx is the frequency of frequency x So N10 1 N1 3 etc To estimate total number of unseen species Use number of species words we ve seen once c0 c1 p0 N1 N All other estimates are adjusted down to give probabilities for unseen 2 7 08 Slide from Josh Goodman 12 4 Good Turing Intuition Notation Nx is the frequency of frequency x So N10 1 N1 3 etc To estimate total number of unseen species Use number of species words we ve seen once c0 c1 p0 N1 N p 0 N1 N 3 18 All other estimates are adjusted down to give probabilities for unseen P eel c 1 1 1 1 3 2 3 2 7 08 Slide from Josh Goodman 13 Bigram frequencies of frequencies and GT re estimates 14 2 7 08 GT smoothed bigram probs 15 2 7 08 5 Backoff and Interpolation Another really useful source of knowledge If we are estimating trigram p z xy but c xyz is zero Use info from Bigram p z y Or even Unigram p z How to combine the trigram bigram unigram info 16 2 7 08 Backoff versus interpolation Backoff use trigram if you have it otherwise bigram otherwise unigram Interpolation mix all three 17 2 7 08 Interpolation Simple interpolation Lambdas conditional on context 18 2 7 08 6 How to set the lambdas Use a held out corpus Choose lambdas which maximize the probability of some held out data I e fix the N gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held out set Can use EM to do this search 19 2 7 08 Practical Issues We do everything in log space Avoid underflow also adding is faster than multiplying 20 2 7 08 Language Modeling Toolkits SRILM CMU Cambridge LM Toolkit 21 2 7 08 7 Google N Gram Release 22 2 7 08 Google N Gram Release serve serve serve serve serve serve serve serve serve serve as as as as as as as as as as the the the the the the the the the the incoming 92 incubator 99 independent 794 index 223 indication 72 indicator 120 indicators 45 indispensable 111 indispensible 40 individual 234 23 2 7 08 LM Summary Probability Basic probability Conditional probability Bayes Rule Language Modeling N grams N gram Intro The Chain Rule Perplexity Smoothing Add 1 Good Turing 24 2 7 08 8 Break Moving quiz to Thursday 2 14 Readings Chapter 2 All Chapter 3 Skip 3 4 1 and 3 12 Chapter 4 Skip 4 7 4 9 4 10 and 4 11 Chapter 5 Read 5 1 through 5 5 25 2 7 08 Outline Probability Part of speech tagging Parts of speech Tag sets Rule based tagging Statistical tagging Simple most frequent tag baseline Important Ideas Training sets and test sets Unknown words Error analysis HMM tagging 26 2 7 08 Part of Speech tagging Part of speech tagging Parts of speech What s POS tagging good for anyhow Tag sets Rule based tagging Statistical tagging Simple most frequent tag baseline Important Ideas Training sets and test sets Unknown words HMM tagging 27 2 7 08 9 Parts of Speech 8 ish traditional parts of speech Noun verb adjective preposition adverb article interjection pronoun conjunction etc Called parts of speech lexical category word classes morphological classes lexical tags POS Lots of debate in linguistics about the number nature and universality of these We ll completely ignore this debate 28 2 7 08 POS examples N V ADJ ADV P PRO DET noun chair bandwidth pacing verb study debate munch adjective purple tall ridiculous adverb unfortunately slowly preposition of by to pronoun I me mine determiner the a that those 29 2 7 08 POS Tagging Definition The process of assigning a part of speech or lexical class marker to each word in a corpus WORDS TAGS the koala put the keys on the table N V P DET 30 2 7 08 10 POS Tagging example WORD tag the koala put the keys on the table DET N V DET N P DET N 31 2 7 08 What is POS tagging good for First step of a vast number of practical tasks Speech synthesis How to pronounce lead INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT Parsing Need to know if a word is an N or V before you can parse Information extraction Finding names relations etc Machine Translation 32 2 7 08 Open and Closed Classes Closed class a relatively fixed membership Prepositions of in by Auxiliaries may can will had been Pronouns I you she mine his them Usually function words short common words which play a role in grammar Open class new ones can be created all the time English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all 33 2 7 08 11 Open class words Nouns Proper nouns Boulder Granby Eli Manning English capitalizes these Common nouns the rest Count nouns and mass nouns Count have plurals get counted goat goats one goat two goats Mass don t get counted snow salt …
View Full Document
Unlocking...