UMD CMSC 723 - Hidden Markov Models - D2638860

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 723> Hidden Markov Models

UMD CMSC 723 - Hidden Markov Models

School name University of Maryland, College Park

Course Cmsc 723- Computational Linguistics I

Pages 10

Download Save

Unformatted text preview:

1Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004Hidden Markov Model (HMM) HMMs allow you to estimate probabilitiesof unobserved events Given plain text, which underlyingparameters generated the surface E.g., in speech recognition, the observeddata is the acoustic signal and the wordsare the hidden parametersHMMs and their Usage HMMs are very common in ComputationalLinguistics: Speech recognition (observed: acoustic signal,hidden: words) Handwriting recognition (observed: image, hidden:words) Part-of-speech tagging (observed: words, hidden:part-of-speech tags) Machine translation (observed: foreign words,hidden: words in target language)Noisy Channel Model In speech recognition you observe anacoustic signal (A=a1,…,an) and you wantto determine the most likely sequence ofwords (W=w1,…,wn): P(W | A) Problem: A and W are too specific forreliable counts on observed data, andare very unlikely to occur in unseen data2Noisy Channel Model Assume that the acoustic signal (A) is alreadysegmented wrt word boundaries P(W | A) could be computed as Problem: Finding the most likely wordcorresponding to a acoustic representationdepends on the context E.g., /'pre-z&ns / could mean “presents” or“presence” depending on the context! P(W | A) = maxwiai"P(wi| ai)Noisy Channel Model Given a candidate sequence W we needto compute P(W) and combine it withP(W | A) Applying Bayes’ rule: The denominator P(A) can be dropped,because it is constant for all W! argmaxWP(W | A) = argmaxWP(A |W )P(W )P(A)7Noisy Channel in a PictureDecodingThe decoder combines evidence from The likelihood: P(A | W) This can be approximated as: The prior: P(W) This can be approximated as:! P(W ) " P(w1) P(wii= 2n#| wi$1)! P(A |W ) " P(aii=1n#| wi)3Search Space Given a word-segmented acoustic sequencelist all candidates Compute the most likely pathpresentsexpressiveboldpressinactiveboughtpresenceexpensivebaldpresidentsexcessiveboat'pre-z&nsik-'spen-siv'bot! P(inactive | bald)! P('bot | bald)Markov Assumption The Markov assumption states thatprobability of the occurrence of word wi attime t depends only on occurrence ofword wi-1 at time t-1 Chain rule: Markov assumption:! P(w1,...,wn) " P(wi| wi#1)i= 2n$! P(w1,...,wn) = P(wi| w1,...,wi"1)i= 2n#The TrellisParameters of an HMM States: A set of states S=s1,…,sn Transition probabilities: A= a1,1,a1,2,…,an,n Eachai,j represents the probability of transitioningfrom state si to sj. Emission probabilities: A set B of functions ofthe form bi(ot) which is the probability ofobservation ot being emitted by si Initial state distribution: is the probability thatsi is a start state! "i4The Three Basic HMM Problems Problem 1 (Evaluation): Given the observationsequence O=o1,…,oT and an HMM model , how do we compute theprobability of O given the model? Problem 2 (Decoding): Given the observationsequence O=o1,…,oT and an HMM model , how do we find the statesequence that best explains the observations?! "= (A,B,#)! "= (A,B,#) Problem 3 (Learning): How do we adjustthe model parameters , tomaximize ?The Three Basic HMM Problems! "= (A,B,#)! P(O |")Problem 1: Probability of an ObservationSequence What is ? The probability of a observation sequence isthe sum of the probabilities of all possible statesequences in the HMM. Naïve computation is very expensive. Given Tobservations and N states, there are NTpossible state sequences. Even small HMMs, e.g. T=10 and N=10,contain 10 billion different paths Solution to this and problem 2 is to usedynamic programming! P(O |")Forward Probabilities What is the probability that, given anHMM , at time t the state is i and thepartial observation o1 … ot has beengenerated?! "t(i) = P(o1... ot, qt= si|#)! "5Forward Probabilities! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot)! "t(i) = P(o1...ot, qt= si|#)Forward Algorithm Initialization: Induction: Termination:! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot) 2 + t + T,1 + j + N! "1(i) =#ibi(o1) 1 $ i $ N! P(O |") =#T(i)i=1N$Forward Algorithm Complexity In the naïve approach to solving problem1 it takes on the order of 2T*NTcomputations The forward algorithm takes on the orderof N2T computationsBackward Probabilities Analogous to the forward probability, justin the other direction What is the probability that given anHMM and given the state at time t is i,the partial observation ot+1 … oT isgenerated?! "t(i) = P(ot +1...oT| qt= si,#)! "6Backward Probabilities! "t(i) = aijbj(ot +1)"t +1( j)j=1N#$ % & & ' ( ) ) ! "t(i) = P(ot +1...oT| qt= si,#)Backward Algorithm Initialization: Induction: Termination:! "T(i) = 1, 1 # i # N! "t(i) = aijbj(ot +1)"t +1( j)j=1N#$ % & & ' ( ) ) t = T *1...1,1 + i + N! P(O |") =#i$1(i)i=1N%Problem 2: Decoding The solution to Problem 1 (Evaluation) gives usthe sum of all paths through an HMM efficiently. For Problem 2, we wan to find the path with thehighest probability. We want to find the state sequence Q=q1…qT,such that! Q = argmaxQ'P(Q'| O,")Viterbi Algorithm Similar to computing the forwardprobabilities, but instead of summingover transitions from incoming states,compute the maximum Forward: Viterbi Recursion:! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot)! "t( j) = max1#i# N"t$1(i) aij[ ]bj(ot)7Viterbi Algorithm Initialization: Induction: Termination: Read out path:! "1(i) =#ibj(o1) 1 $ i $ N! "t( j) = max1#i# N"t$1(i) aij[ ]bj(ot)! "t( j) = argmax1#i#N$t%1(i) aij& ' ( ) * + 2 # t # T,1 # j # N! p*= max1"i" N#T(i)! qT*= argmax1"i" N#T(i)! qt*="t +1(qt +1*) t = T #1,...,1Problem 3: Learning Up to now we’ve assumed that we know theunderlying model Often these parameters are estimated onannotated training data, which has twodrawbacks: Annotation is difficult and/or expensive Training data is different from the current data We want to maximize the parameters withrespect to the current data, i.e., we’re lookingfor a model , such that! "= (A,B,#)! "'! "'= argmax"P(O |")Problem 3: Learning Unfortunately, there is no known way toanalytically find a global maximum, i.e., amodel , such that But it is possible to find a local maximum Given an initial model , we can always find amodel , such that! "'! "'=

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 723 - Hidden Markov Models

Sign up for free to view:

Please select your school