15-780: Graduate ArtificialIntelligenceHidden Markov Models (HMMs)What’s wrong with Bayesiannetworks• Bayesian networks are very useful for modeling jointdistributions• But they have their limitations: - Cannot account for temporal / sequence models - Dag’s (no self or any other loops)This is not a validBayesian network!Hidden Markov models• Model a set of observation with a set of hidden states - Robot movement Observations: range sensor, visual sensor Hidden states: location (on a map) - Speech processing Observations: sound signals Hidden states: parts of speech, words - Biology Observations: DNA base pairs Hidden states: GenesHidden Markov models• Model a set of observation with a set of hidden states - Robot movement Observations: range sensor, visual sensor Hidden states: location (on a map) - Speech processing Observations: sound signals Hidden states: parts of speech, words - Biology Observations: DNA base pairs Hidden states: Genes1. Hidden states generate observations2. Hidden states transition to other hidden statesExamples: Speech processingExample: Biological dataATGAAGCTACTGTCTTCTATCGAACAAGCATGCGATATTTGCCGACTTAAAAAGCTCAAGTGCTCCAAAGAAAAACCGAAGTGCGCCAAGTGTCTGAAGAACAACTGGGAGTGTCGCTACTCTCCCAAAACCAAAAGGTCTCCGCTGACTAGGGCACATCTGACAGAAGTGGAATCAAGGCTAGAAAGACTGGAACAGCTATTTCTACTGATTTTTCCTCGAGAAGACCTTGACATGATTExample: Gambling on diceoutcome• Two dices, both skewed (output model).• Can either stay with the same dice or switch to thesecond dice (transition mode).AB0.20.20.80.8A Hidden Markov model• A set of states {s1 … sn} - In each time point we are in exactly one of these statesdenoted by qt• Πi, the probability that we start at state si• A transition probability model, P(qt = si | qt-1 = sj)• A set of possible outputs Σ - In time point t we emit a symbol σ∈Σ• An emission probability model, p(ot = σ | si)AB0.20.20.80.8The Markov property• A set of states {s1 … sn} - In each time point we are in exactly one of these statesdenoted by qt• Πi, the probability that we start at state si• A transition probability model, P(qt = si | qt-1 = sj)• A set of possible outputs Σ - In time point t we emit a symbol ot∈Σ• An emission probability model, p(ot | si)A0.20.20.80.8An important aspect of this definitions is the Markov property:qt+1 is conditionally independent of qt-1 (and any earlier timepoints) given qtMore formally P(qt+1 = si | qt = sj) = P(qt+1 = si | qt = sj ,qt-1 = sj)What can we ask when using aHMM?A few examples:• “What dice is currently being used?”• “What is the probability of a 6 in the next role?”• “What is the probability of 6 in any of the next 3 roles?”AB0.20.20.80.8Inference in HMMs• Computing P(Q) and P(qt = si) - If we cannot look at observations• Computing P(Q | O) and P(qt = si |O) - When we have observation and care about the laststate only• Computing argmaxQP(Q | O) - When we care about the entire pathWhat dice is currently being used?• There where t rounds so far• We want to determine P(qt = A)• Lets assume for now that we cannot observe any outputs(we are blind folded)• How can we compute this?AB0.20.20.80.8P(qt = A)?• Simple answer: Lets determine P(Q) where Q is any path that ends in A Q = q1, … qt-1, A P(Q) = P(q1, … qt-1, A) = P(A | q1, … qt-1) P(q1, … qt-1) =P(A | qt-1) P(q1, … qt-1) = … = P(A | qt-1) … P(q2 | q1) P(q1)AB0.20.20.80.8Markov property!Initial probabilityP(qt = A)?• Simple answer: 1. Lets determine P(Q) where Q is any path that ends in A Q = q1, … qt-1, A P(Q) = P(q1, … qt-1, A) = P(A | q1, … qt-1) P(q1, … qt-1) =P(A | qt-1) P(q1, … qt-1) = … = P(A | qt-1) … P(q2 | q1) P(q1) 2. P(qt = A) = ΣP(Q)where the sum is over all sets of tsates that end in AP(qt = A)?• Simple answer: 1. Lets determine P(Q) where Q is any path that ends in A Q = q1, … qt-1, A P(Q) = P(q1, … qt-1, A) = P(A | q1, … qt-1) P(q1, … qt-1) =P(A | qt-1) P(q1, … qt-1) = … = P(A | qt-1) … P(q2 | q1) P(q1) 2. P(qt = A) = ΣP(Q)where the sum is over all sets of tsates that end in AQ: How many sets Qare there?A: A lot! (2t-1)Not a feasible solutionP(qt = A), the smart way• Lets define pt(i) = probability state i at time t = p(qt = si)• We can determine pt(i) by induction 1. p1(i) = Πi 2. pt(i) = ?P(qt = A), the smart way• Lets define pt(i) = probability state i at time t = p(qt = si)• We can determine pt(i) by induction 1. p1(i) = Πi 2. pt(i) = Σj p(qt = si | qt-1 = sj)pt-1(j)P(qt = A), the smart way• Lets define pt(i) = probability state i at time t = p(qt = si)• We can determine pt(i) by induction 1. p1(i) = Πi 2. pt(i) = Σj p(qt = si | qt-1 = sj)pt-1(j)This type of computation iscalled dynamic programmingComplexity: O(n2*t).7s2.3s1t3t2t1Time /stateNumber of states in our HMMInference in HMMs• Computing P(Q) and P(qt = si)• Computing P(Q | O) and P(qt = si |O)• Computing argmaxQP(Q)√But what if we observe outputs?• So far, we assumed that we could not observe theoutputs• In reality, we almost always can.AB0.20.20.80.8.3.16.2.15.2.14.1.23.1.22.1.31P(v |B)P(v |A)vBut what if we observe outputs?• So far, we assumed that we could not observe theoutputs• In reality, we almost always can..3.16.2.15.2.14.1.23.1.22.1.31P(v |B)P(v |A)vDoes observing the sequence5, 6, 4, 5, 6, 6Change our belief about the state?AB0.20.20.80.8But what if we observe outputs?• So far, we assumed that we could not observe theoutputs• In reality, we almost always can.AB0.20.20.80.8.3.16.2.15.2.14.1.23.1.22.1.31P(v |B)P(v |A)vDoes observing the sequence5, 6, 4, 5, 6, 6Change our belief about the state?HMMs are oftenrepresented by thefollowing structure:P(qt = A) when outputs areobserved• We want to compute P(qt = A | O1 … Ot)• For ease of writing we will use the following notations(common in literature)• ai,j = P(qt = si | qt-1 = sj)• bi(ot) = P(ot | si)P(qt = A) when outputs areobserved• We want to compute P(qt = A | O1 … Ot)• Lets start with a simpler question. Given a sequence ofstates Q, what is P(Q | O1 … Ot) = P(Q | O)? - It is pretty simple to move from P(Q) to P(qt = A ) - In some cases P(Q) is the more important question - Speech processing - NLPP(Q | O)• We can use Bayes rule:)()()|()|(OPQPQOPOQP =Easy, P(O | Q) = P(o1 | q1) P(o2 | q2) … P(ot | qt)P(Q | O)• We can use Bayes
View Full Document