Hidden Markov models (HMMs)10-601 Machine LearningWhat’s wrong with Bayesian networks• Bayesian networks are very useful for modeling joint distributions• But they have their limitations:- Cannot account for temporal / sequence models- DAG’s (no self or any other loops) This is not a valid Bayesian network!Hidden Markov models• Model a set of observation with a set of hidden states- Robot movementObservations: range sensor, visual sensorHidden states: location (on a map)- Speech processingObservations: sound signalsHidden states: parts of speech, words- BiologyObservations: DNA base pairsHidden states: GenesHidden Markov models• Model a set of observation with a set of hidden states- Robot movementObservations: range sensor, visual sensorHidden states: location (on a map)- Speech processingObservations: sound signalsHidden states: parts of speech, words- BiologyObservations: DNA base pairsHidden states: Genes1. Hidden states generate observations2. Hidden states transition to other hidden statesExamples: Speech processingExample: Biological dataATGAAGCTACTGTCTTCTATCGAACAAGCATGCGATATTTGCCGACTTAAAAAGCTCAAG TGCTCCAAAGAAAAACCGAAGTGCGCCAAGTGTCTGAAGAACAACTGGGAGTGTCGCTAC TCTCCCAAAACCAAAAGGTCTCCGCTGACTAGGGCACATCTGACAGAAGTGGAATCAAGG CTAGAAAGACTGGAACAGCTATTTCTACTGATTTTTCCTCGAGAAGACCTTGACATGATTExample: Gambling on dice outcome• Two dices, both skewed (output model).• Can either stay with the same dice or switch to the second dice (transition mode). AB0.20.20.80.8A Hidden Markov model• A set of states {s1… sn}- In each time point we are in exactly one of these states denoted by qt• i, the probability that we start at state si• A transition probability model, P(qt= si| qt-1= sj)• A set of possible outputs - At time t we emit a symbol • An emission probability model, p(ot= | si)AB0.20.20.80.80.50.5The Markov property• A set of states {s1… sn}- In each time point we are in exactly one of these states denoted by qt• i, the probability that we start at state si• A transition probability model, P(qt= si| qt-1= sj)• A set of possible outputs - In time point t we emit a symbol ot• An emission probability model, p(ot| si)A0.20.20.80.8An important aspect of this definitions is the Markov property: qt+1is conditionally independent of qt-1(and any earlier time points) given qtMore formally P(qt+1= si| qt = sj) = P(qt+1= si| qt = sj ,qt-1 = sj)What can we ask when using a HMM?A few examples:• “What dice is currently being used?”• “What is the probability of a 6 in the next role?”• “What is the probability of 6 in any of the next 3 roles?”AB0.20.20.80.8Inference in HMMs• Computing P(Q) and P(qt= si)- If we cannot look at observations• Computing P(Q | O) and P(qt= si |O)- When we have observation and care about the last state only• Computing argmaxQP(Q | O)- When we care about the entire pathWhat dice is currently being used?• We played t rounds so far• We want to determine P(qt= A)• Lets assume for now that we cannot observe any outputs (we are blind folded)• How can we compute this?AB0.20.20.80.80.50.5P(qt= A)?• Simple answer:Lets determine P(Q) where Q is any path that ends in AQ = q1, … qt-1, AP(Q) = P(q1, … qt-1, A) = P(A | q1, … qt-1) P(q1, … qt-1) = P(A | qt-1) P(q1, … qt-1) = … = P(A | qt-1) … P(q2| q1) P(q1)AB0.20.20.80.8Markov property!Initial probabilityP(qt= A)?• Simple answer:1. Lets determine P(Q) where Q is any path that ends in AQ = q1, … qt-1, AP(Q) = P(q1, … qt-1, A) = P(A | q1, … qt-1) P(q1, … qt-1) = P(A | qt-1) P(q1, … qt-1) = … = P(A | qt-1) … P(q2| q1) P(q1)2. P(qt= A) = P(Q)where the sum is over all sets of t states that end in AP(qt= A)?• Simple answer:1. Lets determine P(Q) where Q is any path that ends in AQ = q1, … qt-1, AP(Q) = P(q1, … qt-1, A) = P(A | q1, … qt-1) P(q1, … qt-1) = P(A | qt-1) P(q1, … qt-1) = … = P(A | qt-1) … P(q2| q1) P(q1)2. P(qt= A) = P(Q)where the sum is over all sets of t sates that end in AQ: How many sets Q are there?A: A lot! (2t-1)Not a feasible solutionP(qt= A), the smart way• Lets define pt(i) as the probability of being in state i at time t: pt(i) = p(qt= si)• We can determine pt(i) by induction1. p1(i) = i2. pt(i) = ?P(qt= A), the smart way• Lets define pt(i) = probability state i at time t = p(qt= si)• We can determine pt(i) by induction1. p1(i) = i2. pt(i) = jp(qt= si| qt-1= sj)pt-1(j)P(qt= A), the smart way• Lets define pt(i) = probability state i at time t = p(qt= si)• We can determine pt(i) by induction1. p1(i) = i2. pt(i) = jp(qt= si| qt-1= sj)pt-1(j) This type of computation is called dynamic programmingComplexity: O(n2*t)Time / statet1 t2 t3s1 .3s2 .7Number of states in our HMMInference in HMMs• Computing P(Q) and P(qt= si)• Computing P(Q | O) and P(qt= si |O)• Computing argmaxQP(Q)But what if we observe outputs?• So far, we assumed that we could not observe the outputs• In reality, we almost always can.AB0.20.20.80.8v P(v |A) P(v |B)1 .3 .12 .2 .13 .2 .14 .1 .25 .1 .26 .1 .3But what if we observe outputs?• So far, we assumed that we could not observe the outputs• In reality, we almost always can.v P(v |A) P(v |B)1 .3 .12 .2 .13 .2 .14 .1 .25 .1 .26 .1 .3Does observing the sequence 5, 6, 4, 5, 6, 6Change our belief about the state?AB0.20.20.80.8P(qt= A) when outputs are observed• We want to compute P(qt= A | O1… Ot)• For ease of writing we will use the following notations (commonly used in the literature)• aj,i= P(qt= si| qt-1= sj)• bi(ot) = P(ot| si)Transition probabilityEmission probabilityP(qt= A) when outputs are observed• We want to compute P(qt= A | O1… Ot)• Lets start with a simpler question. Given a sequence of states Q, what is P(Q | O1… Ot) = P(Q | O)?- It is pretty simple to move from P(Q) to P(qt= A ) - In some cases P(Q) is the more important question- Speech processing- NLPP(Q | O)• We can use Bayes rule:)()()|()|(OPQPQOPOQP Easy, P(O | Q) = P(o1| q1) P(o2| q2) … P(ot| qt)P(Q | O)• We can use Bayes rule:)()()|()|(OPQPQOPOQP Easy, P(Q) = P(q1) P(q2| q1) … P(qt| qt-1)P(Q | O)• We can use Bayes rule:)()()|()|(OPQPQOPOQP Hard!P(O)• What is the probability of seeing a set of observations: - An important question in it own rights, for example classification using two HMMs• Define t(i) = P(o1, o2 …, ot qt = si)• t(i) is the probability that we:1.
View Full Document