1Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004Hidden Markov Model (HMM) HMMs allow you to estimate probabilitiesof unobserved events Given plain text, which underlyingparameters generated the surface E.g., in speech recognition, the observeddata is the acoustic signal and the wordsare the hidden parametersHMMs and their Usage HMMs are very common in ComputationalLinguistics: Speech recognition (observed: acoustic signal,hidden: words) Handwriting recognition (observed: image, hidden:words) Part-of-speech tagging (observed: words, hidden:part-of-speech tags) Machine translation (observed: foreign words,hidden: words in target language)Noisy Channel Model In speech recognition you observe anacoustic signal (A=a1,…,an) and you wantto determine the most likely sequence ofwords (W=w1,…,wn): P(W | A) Problem: A and W are too specific forreliable counts on observed data, andare very unlikely to occur in unseen data2Noisy Channel Model Assume that the acoustic signal (A) is alreadysegmented wrt word boundaries P(W | A) could be computed as Problem: Finding the most likely wordcorresponding to a acoustic representationdepends on the context E.g., /'pre-z&ns / could mean “presents” or“presence” depending on the context! P(W | A) = maxwiai"P(wi| ai)Noisy Channel Model Given a candidate sequence W we needto compute P(W) and combine it withP(W | A) Applying Bayes’ rule: The denominator P(A) can be dropped,because it is constant for all W! argmaxWP(W | A) = argmaxWP(A |W )P(W )P(A)7Noisy Channel in a PictureDecodingThe decoder combines evidence from The likelihood: P(A | W) This can be approximated as: The prior: P(W) This can be approximated as:! P(W ) " P(w1) P(wii= 2n#| wi$1)! P(A |W ) " P(aii=1n#| wi)3Search Space Given a word-segmented acoustic sequencelist all candidates Compute the most likely pathpresentsexpressiveboldpressinactiveboughtpresenceexpensivebaldpresidentsexcessiveboat'pre-z&nsik-'spen-siv'bot! P(inactive | bald)! P('bot | bald)Markov Assumption The Markov assumption states thatprobability of the occurrence of word wi attime t depends only on occurrence ofword wi-1 at time t-1 Chain rule: Markov assumption:! P(w1,...,wn) " P(wi| wi#1)i= 2n$! P(w1,...,wn) = P(wi| w1,...,wi"1)i= 2n#The TrellisParameters of an HMM States: A set of states S=s1,…,sn Transition probabilities: A= a1,1,a1,2,…,an,n Eachai,j represents the probability of transitioningfrom state si to sj. Emission probabilities: A set B of functions ofthe form bi(ot) which is the probability ofobservation ot being emitted by si Initial state distribution: is the probability thatsi is a start state! "i4The Three Basic HMM Problems Problem 1 (Evaluation): Given the observationsequence O=o1,…,oT and an HMM model , how do we compute theprobability of O given the model? Problem 2 (Decoding): Given the observationsequence O=o1,…,oT and an HMM model , how do we find the statesequence that best explains the observations?! "= (A,B,#)! "= (A,B,#) Problem 3 (Learning): How do we adjustthe model parameters , tomaximize ?The Three Basic HMM Problems! "= (A,B,#)! P(O |")Problem 1: Probability of an ObservationSequence What is ? The probability of a observation sequence isthe sum of the probabilities of all possible statesequences in the HMM. Naïve computation is very expensive. Given Tobservations and N states, there are NTpossible state sequences. Even small HMMs, e.g. T=10 and N=10,contain 10 billion different paths Solution to this and problem 2 is to usedynamic programming! P(O |")Forward Probabilities What is the probability that, given anHMM , at time t the state is i and thepartial observation o1 … ot has beengenerated?! "t(i) = P(o1... ot, qt= si|#)! "5Forward Probabilities! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot)! "t(i) = P(o1...ot, qt= si|#)Forward Algorithm Initialization: Induction: Termination:! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot) 2 + t + T,1 + j + N! "1(i) =#ibi(o1) 1 $ i $ N! P(O |") =#T(i)i=1N$Forward Algorithm Complexity In the naïve approach to solving problem1 it takes on the order of 2T*NTcomputations The forward algorithm takes on the orderof N2T computationsBackward Probabilities Analogous to the forward probability, justin the other direction What is the probability that given anHMM and given the state at time t is i,the partial observation ot+1 … oT isgenerated?! "t(i) = P(ot +1...oT| qt= si,#)! "6Backward Probabilities! "t(i) = aijbj(ot +1)"t +1( j)j=1N#$ % & & ' ( ) ) ! "t(i) = P(ot +1...oT| qt= si,#)Backward Algorithm Initialization: Induction: Termination:! "T(i) = 1, 1 # i # N! "t(i) = aijbj(ot +1)"t +1( j)j=1N#$ % & & ' ( ) ) t = T *1...1,1 + i + N! P(O |") =#i$1(i)i=1N%Problem 2: Decoding The solution to Problem 1 (Evaluation) gives usthe sum of all paths through an HMM efficiently. For Problem 2, we wan to find the path with thehighest probability. We want to find the state sequence Q=q1…qT,such that! Q = argmaxQ'P(Q'| O,")Viterbi Algorithm Similar to computing the forwardprobabilities, but instead of summingover transitions from incoming states,compute the maximum Forward: Viterbi Recursion:! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot)! "t( j) = max1#i# N"t$1(i) aij[ ]bj(ot)7Viterbi Algorithm Initialization: Induction: Termination: Read out path:! "1(i) =#ibj(o1) 1 $ i $ N! "t( j) = max1#i# N"t$1(i) aij[ ]bj(ot)! "t( j) = argmax1#i#N$t%1(i) aij& ' ( ) * + 2 # t # T,1 # j # N! p*= max1"i" N#T(i)! qT*= argmax1"i" N#T(i)! qt*="t +1(qt +1*) t = T #1,...,1Problem 3: Learning Up to now we’ve assumed that we know theunderlying model Often these parameters are estimated onannotated training data, which has twodrawbacks: Annotation is difficult and/or expensive Training data is different from the current data We want to maximize the parameters withrespect to the current data, i.e., we’re lookingfor a model , such that! "= (A,B,#)! "'! "'= argmax"P(O |")Problem 3: Learning Unfortunately, there is no known way toanalytically find a global maximum, i.e., amodel , such that But it is possible to find a local maximum Given an initial model , we can always find amodel , such that! "'! "'=
View Full Document