Unformatted text preview:

1Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004Hidden Markov Model (HMM) HMMs allow you to estimate probabilitiesof unobserved events Given plain text, which underlyingparameters generated the surface E.g., in speech recognition, the observeddata is the acoustic signal and the wordsare the hidden parametersHMMs and their Usage HMMs are very common in ComputationalLinguistics: Speech recognition (observed: acoustic signal,hidden: words) Handwriting recognition (observed: image, hidden:words) Part-of-speech tagging (observed: words, hidden:part-of-speech tags) Machine translation (observed: foreign words,hidden: words in target language)Noisy Channel Model In speech recognition you observe anacoustic signal (A=a1,…,an) and you wantto determine the most likely sequence ofwords (W=w1,…,wn): P(W | A) Problem: A and W are too specific forreliable counts on observed data, andare very unlikely to occur in unseen data2Noisy Channel Model Assume that the acoustic signal (A) is alreadysegmented wrt word boundaries P(W | A) could be computed as Problem: Finding the most likely wordcorresponding to a acoustic representationdepends on the context E.g., /'pre-z&ns / could mean “presents” or“presence” depending on the context! P(W | A) = maxwiai"P(wi| ai)Noisy Channel Model Given a candidate sequence W we needto compute P(W) and combine it withP(W | A) Applying Bayes’ rule: The denominator P(A) can be dropped,because it is constant for all W! argmaxWP(W | A) = argmaxWP(A |W )P(W )P(A)7Noisy Channel in a PictureDecodingThe decoder combines evidence from The likelihood: P(A | W) This can be approximated as: The prior: P(W) This can be approximated as:! P(W ) " P(w1) P(wii= 2n#| wi$1)! P(A |W ) " P(aii=1n#| wi)3Search Space Given a word-segmented acoustic sequencelist all candidates Compute the most likely pathpresentsexpressiveboldpressinactiveboughtpresenceexpensivebaldpresidentsexcessiveboat'pre-z&nsik-'spen-siv'bot! P(inactive | bald)! P('bot | bald)Markov Assumption The Markov assumption states thatprobability of the occurrence of word wi attime t depends only on occurrence ofword wi-1 at time t-1 Chain rule: Markov assumption:! P(w1,...,wn) " P(wi| wi#1)i= 2n$! P(w1,...,wn) = P(wi| w1,...,wi"1)i= 2n#The TrellisParameters of an HMM States: A set of states S=s1,…,sn Transition probabilities: A= a1,1,a1,2,…,an,n Eachai,j represents the probability of transitioningfrom state si to sj. Emission probabilities: A set B of functions ofthe form bi(ot) which is the probability ofobservation ot being emitted by si Initial state distribution: is the probability thatsi is a start state! "i4The Three Basic HMM Problems Problem 1 (Evaluation): Given the observationsequence O=o1,…,oT and an HMM model , how do we compute theprobability of O given the model? Problem 2 (Decoding): Given the observationsequence O=o1,…,oT and an HMM model , how do we find the statesequence that best explains the observations?! "= (A,B,#)! "= (A,B,#) Problem 3 (Learning): How do we adjustthe model parameters , tomaximize ?The Three Basic HMM Problems! "= (A,B,#)! P(O |")Problem 1: Probability of an ObservationSequence What is ? The probability of a observation sequence isthe sum of the probabilities of all possible statesequences in the HMM. Naïve computation is very expensive. Given Tobservations and N states, there are NTpossible state sequences. Even small HMMs, e.g. T=10 and N=10,contain 10 billion different paths Solution to this and problem 2 is to usedynamic programming! P(O |")Forward Probabilities What is the probability that, given anHMM , at time t the state is i and thepartial observation o1 … ot has beengenerated?! "t(i) = P(o1... ot, qt= si|#)! "5Forward Probabilities! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot)! "t(i) = P(o1...ot, qt= si|#)Forward Algorithm Initialization: Induction: Termination:! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot) 2 + t + T,1 + j + N! "1(i) =#ibi(o1) 1 $ i $ N! P(O |") =#T(i)i=1N$Forward Algorithm Complexity In the naïve approach to solving problem1 it takes on the order of 2T*NTcomputations The forward algorithm takes on the orderof N2T computationsBackward Probabilities Analogous to the forward probability, justin the other direction What is the probability that given anHMM and given the state at time t is i,the partial observation ot+1 … oT isgenerated?! "t(i) = P(ot +1...oT| qt= si,#)! "6Backward Probabilities! "t(i) = aijbj(ot +1)"t +1( j)j=1N#$ % & & ' ( ) ) ! "t(i) = P(ot +1...oT| qt= si,#)Backward Algorithm Initialization: Induction: Termination:! "T(i) = 1, 1 # i # N! "t(i) = aijbj(ot +1)"t +1( j)j=1N#$ % & & ' ( ) ) t = T *1...1,1 + i + N! P(O |") =#i$1(i)i=1N%Problem 2: Decoding The solution to Problem 1 (Evaluation) gives usthe sum of all paths through an HMM efficiently. For Problem 2, we wan to find the path with thehighest probability. We want to find the state sequence Q=q1…qT,such that! Q = argmaxQ'P(Q'| O,")Viterbi Algorithm Similar to computing the forwardprobabilities, but instead of summingover transitions from incoming states,compute the maximum Forward: Viterbi Recursion:! "t( j) ="t#1(i) aiji=1N$% & ' ( ) * bj(ot)! "t( j) = max1#i# N"t$1(i) aij[ ]bj(ot)7Viterbi Algorithm Initialization: Induction: Termination: Read out path:! "1(i) =#ibj(o1) 1 $ i $ N! "t( j) = max1#i# N"t$1(i) aij[ ]bj(ot)! "t( j) = argmax1#i#N$t%1(i) aij& ' ( ) * + 2 # t # T,1 # j # N! p*= max1"i" N#T(i)! qT*= argmax1"i" N#T(i)! qt*="t +1(qt +1*) t = T #1,...,1Problem 3: Learning Up to now we’ve assumed that we know theunderlying model Often these parameters are estimated onannotated training data, which has twodrawbacks: Annotation is difficult and/or expensive Training data is different from the current data We want to maximize the parameters withrespect to the current data, i.e., we’re lookingfor a model , such that! "= (A,B,#)! "'! "'= argmax"P(O |")Problem 3: Learning Unfortunately, there is no known way toanalytically find a global maximum, i.e., amodel , such that But it is possible to find a local maximum Given an initial model , we can always find amodel , such that! "'! "'=


View Full Document

UMD CMSC 723 - Hidden Markov Models

Documents in this Course
Lecture 9

Lecture 9

12 pages

Smoothing

Smoothing

15 pages

Load more
Download Hidden Markov Models
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Hidden Markov Models and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hidden Markov Models 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?