CSCI 5832 Natural Language Processing Lecture 8 Jim Martin 01 14 19 CSCI 5832 Spring 2007 1 Today 2 8 Review N Grams Entropy Models Parts of Speech and Tagging 01 14 19 CSCI 5832 Spring 2007 2 N Gram Models Assigning probabilities to sequences by Using the chain rule to decompose the problem Make some conditional independence assumptions to simplify things Use smoothing and backoff to massage the counts into something that works 01 14 19 CSCI 5832 Spring 2007 3 Back to Fish That 3 18 is the main thing to remember about Good Turing 01 14 19 It s the probability mass we re reserving for the zero counts CSCI 5832 Spring 2007 4 Good Turing But the basic Good Turing approach is pretty broken when it comes to The other bigger buckets And how to redistribute the mass among the zero counts 01 14 19 CSCI 5832 Spring 2007 5 Katz Backoff Trigram Case 01 14 19 CSCI 5832 Spring 2007 6 What Makes a Good Model Two answers Models that make your end application run better In vivo evaluation Models that predict well the nature of unseen representative texts 01 14 19 CSCI 5832 Spring 2007 7 Information Theory Who is going to win the World Series next year Well there are 30 teams Each has a chance so there s a 1 30 chance for any team No Rockies Big surprise lots of information Yankees No surprise not much information 01 14 19 CSCI 5832 Spring 2007 8 Information Theory How much uncertainty is there when you don t know the outcome of some event answer to some question How much information is to be gained by knowing the outcome of some event answer to some question 01 14 19 CSCI 5832 Spring 2007 9 Information Theory This stuff is usually explained either in terms of betting or in terms of communication codes Number of bits needed to communicate messages on average Neither of which is terribly illuminating for language applications 01 14 19 CSCI 5832 Spring 2007 10 Aside on logs Base doesn t matter Unless I say otherwise I mean base 2 Probabilities lie between 0 an 1 So log probabilities are negative and range from 0 log 1 to infinity log 0 The is a pain so at some point we ll make it go away by multiplying by 1 01 14 19 CSCI 5832 Spring 2007 11 Entropy Let s start with a simple case the probability of word sequences with a unigram model Example S One fish two fish red fish blue fish P S P One P fish P two P fish P red P fish P blue P fish Log P S Log P One Log P fish Log P fish 01 14 19 CSCI 5832 Spring 2007 12 Entropy cont In general that s But note that log P S logP w the order doesn t matter that words can occur multiple times and that they always contribute the same each time log P s so rearranging 01 14 19 w S Count w log P w w V CSCI 5832 Spring 2007 13 Entropy cont One fish two fish red fish blue fish Fish fish fish fish one two red blue LogP s 4 log P fish 1 log P one 1 log P two 1 log P red 1 log P blue 01 14 19 CSCI 5832 Spring 2007 14 Entropy cont Now let s divide both sides by N the length of the sequence 1 1 log P S N N Count w log P w w V That s basically a per word average of the log probabilities 01 14 19 CSCI 5832 Spring 2007 15 Entropy Now assume the sequence is really really long Moving the N into the summation you get Count w log P w N w V Rewriting and getting rid of the minus H S P w log P w sign w V 01 14 19 CSCI 5832 Spring 2007 16 Entropy Think about this in terms of uncertainty or surprise The more likely a sequence is the lower the entropy Why H S P w log P w w V 01 14 19 CSCI 5832 Spring 2007 17 Entropy Note that that sum is over the types of the elements of the model being used unigrams bigrams trigrams etc not the words in the sequence 01 14 19 CSCI 5832 Spring 2007 18 Model Evaluation Remember the name of the game is to come up with statistical models that capture something useful in some body of text or speech There are precisely a gazzilion ways to do this N grams of various sizes Smoothing Backoff 01 14 19 CSCI 5832 Spring 2007 19 Model Evaluation Given a collection of text and a couple of models how can we tell which model is best Intuition the model that assigns the highest probability lowest entropy to a set of withheld text Withheld text Text drawn from the same distribution corpus but not used in the creation of the model being evaluated 01 14 19 CSCI 5832 Spring 2007 20 Model Evaluation The more you re surprised at some event that actually happens the worse your model was We want models that minimize your surprise at observed outcomes Given two models and some training data and some withheld test data which is better The model where you re not surprised to see the test data 01 14 19 CSCI 5832 Spring 2007 21 Break Quiz is Thursday Next HW details coming soon Shifting to Chapter 5 01 14 19 CSCI 5832 Spring 2007 22 Parts of Speech Start with eight basic categories Noun verb pronoun preposition adjective adverb article conjunction These categories are based on morphological and distributional properties not semantics Some cases are easy others are murky 01 14 19 CSCI 5832 Spring 2007 23 Parts of Speech What are some possible parts of speech for building 01 14 19 CSCI 5832 Spring 2007 24 Parts of Speech A quarantine in the Boca Raton building contaminated by deadly anthrax is set to be lifted Dialogue is one of the powerful tools to building an understanding across differences and thereby leading to negotiation It is an easy way to recognise The building project which would be spread out over five years with schools most in need getting work first would cost taxpayers with a Building for Independence as its name indicates demonstrates exactly what Canada s New Government is doing to support Canadians who are homeless or at The last time house building reached such high levels was in 1989 when over 191800 new homes were built State lawmakers are considering building up a trust fund for schools so it will earn more money in the coming decades 01 14 19 CSCI 5832 Spring 2007 25 Tagging State NN lawmakers NNS are VBP considering VBG building VBG up RP a DT trust NN fund NN for IN schools NNS so IN it PRP will MD earn VB more JJR money NN in IN the DT coming VBG decades NNS 01 14 19 CSCI 5832 Spring 2007 26 Parts of …
View Full Document
Unlocking...