Modeling Sequence Data CS4780 5780 Machine Learning Fall 2011 Thorsten Joachims Cornell University Reading Manning Schuetze Sections 9 1 9 3 except 9 3 1 Leeds Online HMM Tutorial except Forward and Forward Backward Algorithm http www comp leeds ac uk roger HiddenMarkovModels html dev main html Outline Markov Models in Classification A less na ve Bayes for text classification Hidden Markov Models Part of speech tagging Viterbi Algorithm Estimation with fully observed training data Less Na ve Bayes Classifier Example Classify sentences as insulting not insulting Assumption l words in document Decision Rule Markov Model Definition Set of States s1 sk Start probabilities P S s Transition probabilities P S s Sprev s Random walk on graph Start in state s with probability P S s Move to next state with probability P S s Sprev s Assumptions Limited dependence Next state depends only on previous state but no other state i e first order Markov model Stationary P S s Sprev s does not change Part of Speech Tagging Task Assign the correct part of speech word class to each word in a document The DT planet NN Jupiter NNP and CC its PRP moons NNS are VBP in IN effect NN a DT mini solar JJ system NN and CC Jupiter NNP itself PRP is VBZ often RB called VBN a DT star NN that IN never RB caught VBN fire NN Needed as an initial processing step for a number of language technology applications Information extraction Answer extraction in QA Base step in identifying syntactic phrases for IR systems Critical for word sense disambiguation WordNet apps Why is POS Tagging Hard Ambiguity He will race VB the car When will the race NN end I bank VB at CFCU Go to the bank NN Average of 2 parts of speech for each word The number of tags used by different systems varies a lot Some systems use 20 tags while others use 400 The POS Learning Problem Example Hidden Markov Model for POS Tagging States Think about as nodes of a graph One for each POS tag special start state and maybe end state Transitions Think about as directed edges in a graph Edges have transition probabilities Output Each state also produces a word of the sequence Sentence is generated by a walk through the graph Hidden Markov Model States y s1 sk Outputs symbols x o1 om Starting probability P Y1 y1 Specifies where the sequence starts Transition probability P Y yi Yi 1 yi 1 Probability that one states succeeds another Output Emission probability P Xi xi Yi yi Probability that word is generated in this state Every output state sequence has a probability Estimating the Probabilities Given Fully observed data Pairs of output sequence with their state sequence Estimating transition probabilities P Yt Yt 1 Estimating emission probabilities P Xt Yt Smoothing the estimates Laplace smoothing uniform prior See na ve Bayes for text classification Partially observed data Expectation Maximization EM Viterbi Example P X x Y y I bank at CFCU go to the DET 0 01 0 01 0 01 0 01 0 01 0 01 0 94 PRP 0 94 0 01 0 01 0 01 0 01 0 01 0 01 N 0 01 0 4 0 01 0 4 0 16 0 01 0 01 PREP 0 01 0 01 0 48 0 01 0 01 0 47 0 01 V 0 01 0 4 0 01 0 01 0 55 0 01 0 01 P Y y P Y Yprev DET PRP N PREP V DET 0 3 DET 0 01 0 01 0 96 0 01 0 01 PRP 0 3 PRP 0 01 0 01 0 01 0 2 0 77 N 0 1 N 0 01 0 2 0 3 0 3 0 19 PREP 0 1 PREP 0 3 0 2 0 3 0 19 0 01 V 0 2 V 0 2 0 19 0 3 0 3 0 01 HMM Decoding Viterbi Algorithm Question What is the most likely state sequence given an output sequence Given fully specified HMM P Y1 y1 P Yi yi Yi 1 yi 1 P Xi xi Yi yi Find Viterbi algorithm has runtime linear in length of sequence Example find the most likely tag sequence for a given sequence of words HMM s for POS Tagging Design HMM structure vanilla States one state per POS tag Transitions fully connected Emissions all words observed in training corpus Estimate probabilities Use corpus e g Treebank Smoothing Unseen words Tagging new sentences Use Viterbi to find most likely tag sequence Experimental Results Tagger Accuracy Training time Prediction time HMM 96 80 20 sec 18 000 words s TBL Rules 96 47 9 days 750 words s Experiment setup WSJ Corpus Trigram HMM model Lexicalized from Pla and Molina 2001 Discriminative vs Generative Bayes Rule Generative Make assumptions about Estimate parameters of the two distributions Discriminative Define set of prediction rules i e hypotheses H Find h in H that best approximates Question Can we train HMM s discriminately Later in semester discriminative training of HMM and general structured prediction
View Full Document
Unlocking...