1/24/2011 CSE842, Spring 2011, MSU 1CSE 842Natural Language ProcessingLecture 4: Part-of-Speech Tagging and HMM1/24/2011 CSE842, Spring 2011, MSU 2ReminderHomework 1 will be posted under ~cse842/Homework/hw1 later today. - Due: Feb. 7- Written part at the beginning of the class- Programming part at 11:59pm on the due date via “handin”1/24/2011 CSE842, Spring 2011, MSU 3Part of Speech Tagging• What is POS tagging?• Associate with each word a lexical tag– 45 classes from Penn Treebank– 87 classes from Brown Corpus– 146 classes from C7 tagset (CLAWS system)1/24/2011 CSE842, Spring 2011, MSU 4Why is POS Tagging Useful? • It is a first step of many practical tasks. • Speech synthesis– How to pronounce “lead”?– INsult inSULT– OBject obJECT– OVERflow overFLOW– DIScount disCOUNT– CONtent conTENT•Parsing– Need to know if a word is an N or V before you can parse• Information extraction– Finding names, relations, etc.1/24/2011 CSE842, Spring 2011, MSU 5POS TaggingWe have:• A set of tags (tagset)• A dictionary with all possible tags for each word• Input textPOS tagging is automatically assign a sequence of tags to input text. e.g., The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NNof/IN other/JJ topics/NNS ./.1/24/2011 CSE842, Spring 2011, MSU 6Tagging Method• Rule-based tagging• Transformation-based tagging• Statistical tagging based on Hidden Markov Model 1/24/2011 CSE842, Spring 2011, MSU 7Statistical Tagging• Using an HMM to do POS tagging is a special case of Bayesian inference• It is also related to the “noisy channel” model that’s the basis for automated speech recognition (ASR), optical character recognition (OCR) and machine translation (MT) 1/24/2011 CSE842, Spring 2011, MSU 8HMM Tagger • HMM Tagger: Choose the most probable sequence of tags for each sentence– Given the sentence W = w1, w2,…. wn– Compute the most probable sequence of tags T = t1, t2,.., tn, so that it maximizes:–argmaxP(t1, t2, …, tn| w1, w2, …, wn)• Apply Bayes law to rewrite it to:),...,(),...,(),...,|,...,(),...,|,...,(111111nnnnnnwwPttPttwwPwwttP =1/24/2011 CSE842, Spring 2011, MSU 9HMM Tagger (Cont.)• It becomes finding t1,…,tnthat maximizes:),...,(),...,|,...,(111 nnnttPttwwP Approximation 1: Conditional independence assumption)|(),...,|,...,(111 iniinntwPttwwP∏=≅ Approximation 2: Bigram )|(),...,(111 −=∏≅iniinttPttP Now it becomes finding t1,…,tnthat maximizes:)|(*)|(11iiiniitwPttP−=∏1/24/2011 CSE842, Spring 2011, MSU 10An ExampleExample: suppose wi= race, a verb (VB) or a noun (NN)?Assume that other mechanism has already done the best tagging to thepreceding words. 1) Secretariat/NNP is/VBZ expected/VBN to/TO race/? tomorrow2) People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/INthe/DT race/? for outer space BigramSimplify the problem:to/To race/???the/DT race/???)|()|(maxarg1 jiijjitwPttPt−=P(VB|TO) P(race | VB)P(NN|TO) P(race | NN)1/24/2011 CSE842, Spring 2011, MSU 11Where is the data? Look at the Brown and Switchboard corporaP(NN | TO) = 0.021P(VB | TO) = 0.34If we are expecting a verb, how likely it would be “race”P( race | NN) = 0.00041P( race | VB) = 0.00003Finally: P(NN | TO) P( race | NN) = 0.000007P( VB | TO) P(race | VB) = 0.000011/24/2011 CSE842, Spring 2011, MSU 12POS Tagging as Sequence Classification• We are given a sentence (an “observation” or “sequence of observations”)– Secretariat is expected to race tomorrow• What is the best sequence of tags that corresponds to this sequence of observations?• Probabilistic view:– Consider all possible sequences of tags– Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1…wn.1/24/2011 CSE842, Spring 2011, MSU 13An ExampleGiven Word sequence w1,…., wNPOS tags t1,…, tM, ti∈[V, N, P, ART]Find:Most likely sequence of POS tags T1,…, TNfor the word sequence that maximizes: P(w1,…,wN| t1, …, tM)Example: flies like a flower, 1/24/2011 CSE842, Spring 2011, MSU 14Example: Bigram of Tags from a Corpus0.26Prob (N | P) 81 P, N 307P 0.74Prob (ART | P)226P, ART 307P 0.65Prob(ART | V) 194V, ART300V 0.35Porb( N | V) 75 V, N 300V 0.44Prob(P | N)366N, P833N 0.13Prob(N | N) 108N, N 833N 0.43Prob(V | N)358N, V833N1Prob(N | ART)558ART, N558ART0.29Prob(N|0)870, N 30000.71Prob(ART|0)2130, ART3000EstimateBigram# at i, i+1Pair# at iCat1/24/2011 CSE842, Spring 2011, MSU 15A Markov Chain ART P VN0 0.710.290.130.6510.430.350.260.440.741/24/2011 CSE842, Spring 2011, MSU 16Sparse Data In practice, we should use smoothing (to be introduced next week). In this example, for illustration purpose, we use a simplification: any bigram data that is not listed in the table will be assumed to have a transitional probability of 0.0001.What about bigrams that are not seen from the data?1/24/2011 CSE842, Spring 2011, MSU 17Word Counts1998307 558300 833Total114228456210 592others6500164bird58001642flowers68001553flower3032 3000 1the202020101a612103010like5501549fruit44002321fliesTotalP ART VNP(the | ART ) ?1/24/2011 CSE842, Spring 2011, MSU 18The Emission ProbabilityP(the | ART ) = 300/558 = 0.54P(flies | N ) = 0.025P(flies | V) = 0.076P(like | V ) = 0.1P(like | P) = 0.068P(like | N) = 0.012P(a | ART ) = 0.360P(a | N ) = 0.001P(flower | N ) = 0.063P(flower | V ) = 0.05P(bird | N) = 0.076Some examples: 1/24/2011 CSE842, Spring 2011, MSU 19Viterbi AlgorithmGiven Word sequence w1,…., wNSet of POS tags t1,…, tMEmission Probability: P(wi| tj)Transition probability: P(ti| t j)Find:Most likely sequence of POS tags T1,…, TNfor the word sequence1/24/2011 CSE842, Spring 2011, MSU 20Viterbi AlgorithmInitialization StepFor i = 1 to M do VSCORE(i, 1) = P(w1| ti) * P(ti| ∅);BACKPTR(i, 1) = 0Iteration StepFor j = 2 to Nfor i = 1 to MVSCORE (i, j) = BACKPTR(i, j) = index of k that gave the max above)|())|(*)1,(VSCORE(1 ijkiMktwPttPjkMax −=Sequence Identification StepC(N)= i that maximizes VSCORE (i, N)For j = N –1 to 1 doC( j )= BACKPTR(C(j+1), j+1)1/24/2011 CSE842, Spring 2011, MSU 21Viterbi Algorithm - Example flies/ARTflies/P flies/Vflies/NNULL/0 7.6*10-60.0072500Iteration 11/24/2011 CSE842, Spring 2011, MSU 22Viterbi Algorithm - Example flies/Vflies/NIteration 2like/ARTlike/P like/Vlike/N1/24/2011 CSE842, Spring 2011, MSU 23Viterbi Algorithm - Example flies/Vflies/N1.13*10-50.000310.000220Iteration 2like/ARTlike/P like/Vlike/N1/24/2011 CSE842,
View Full Document