Learning in HMMs10-601 Machine LearningA Hidden Markov model• A set of states {s1… sn}- In each time point we are in exactly one of these states denoted by qt• i, the probability that we start at state si• A transition probability model, P(qt= si| qt-1= sj)• A set of possible outputs - At time t we emit a symbol • An emission probability model, p(ot= | si)AB0.20.20.80.80.50.5Inference in HMMs• Computing P(Q) and P(qt= si)• Computing P(Q | O) and P(qt= si |O)• Computing argmaxQP(Q)Most probable path• We are almost done …• One final question remainsHow do we find the most probable path, that is Q* such that P(Q* | O) = argmaxQP(Q|O)?• This is an important path- The words in speech processing- The set of genes in the genome- etc.Example• What is the most probable set of states leading to the sequence:1,2,2,5,6,5,1,2,3 ?AB0.20.20.80.8v P(v |A) P(v |B)1 .3 .12 .2 .13 .2 .14 .1 .25 .1 .26 .1 .3A=0.7 b=0.3Most probable path)()()|(maxarg)|(maxargOPQPQOPOQPQQ)()|(maxarg QPQOPQWe will use the following definition:In other words we are interested in the most likely path from 1 to t that:1. Ends in Si 2. Produces outputs O1… Ot)...(max)(11111tittqqtOOsqqqpitComputing t(i))()|()()()(1111111ObsqOpsqpOsqpiiiiii)...(max)(11111tittqqtOOsqqqpitQ: Given t(i), how can we compute t+1(i)?A: To get from t(i) to t+1(i) we need to1. Add an emission for time t+1 (Ot+1)2. Transition to state si)()(max)|()|()(max)...(max)(1,111111111tiijtjittjtittjtittqqtObajsqOpsqsqpjOOsqqqpitThe Viterbi algorithmt 1(i) maxq1qtp(q1qtqt 1 siO1...Ot 1)maxjt( j)p(qt 1 si| qt sj)p(Ot 1| qt 1 si) maxjt( j)aj, ibi(Ot 1)• Once again we use dynamic programming for solving t(i)• Once we have t(i), we can solve for our P(Q*|O)By:P(Q* | O) = argmaxQP(Q|O) = P(Q* | O) = path defined by argmaxj t(j),Inference in HMMs• Computing P(Q) and P(qt= si)• Computing P(Q | O) and P(qt= si |O)• Computing argmaxQP(Q) Learning HMMs• Until now we assumed that the emission and transition probabilities are known• This is usually not the case- How is “AI” pronounced by different individuals?- What is the probability of hearing “class” after “AI”?While we will discuss learning the transition and emission models, we will not discuss selecting the states.This is often the most important task and is heavily dependent on domain knowledge.Example• Assume the model below• We also observe the following sequence:1,2,2,5,6,5,1,2,3,3,5,3,3,2 ….. • How can we determine the initial, transition and emission probabilities?ABInitial probabilitiesQ: assume we can observe the following sets of states:AAABBAAAABBBBBBAABBABhow can we learn the initial probabilities?A: Maximum likelihood estimationFind the initial probabilities such thatABA= #A/ (#A+#B))(maxarg*)|()(maxarg*1211qqqpqkTtttkk is the number of sequences avialable for trainingTransition probabilitiesQ: assume we can observe the set of states:AAABBAAAABBBBBAAAABBBBhow can we learn the transition probabilities?A: Maximum likelihood estimationFind a transition matrix a such thataA,B= #AB / (#AB+#AA)a* argmaxak(q1) p(qt| qt1)t 2Ta* argmaxap(qt| qt1)t 2TABremember that we defined ai,j=p(qt=sj|qt-1=si)Emission probabilitiesQ: assume we can observe the set of states:A A A B B A A A A B B B B B A Aand the set of dice values1 2 3 5 6 3 2 1 1 3 4 5 6 5 2 3how can we learn the emission probabilities?A: Maximum likelihood estimationABbA(5)= #A5 / (#A1+#A2 + … +#A6)Learning HMMs• In most case we do not know what states geenratedeach of the outputs (fully unsupervised)• For these cases we can use expectation maximization (EM) to learn the HMM parametersExpectation Maximization (EM)• Appropriate for problems with „missing values‟ for the variables.• For example, in HMMs we usually do not observe the statesExpectation Maximization (EM)• Two steps• E step: Fill in the expected values for the missing variables• M step: Regular maximum likelihood estimation (MLE) using the values computed in the E step and the values of the other variables• Guaranteed to converge (though only to a local minima).M stepE stepvalues for (missing) variablesparametersExpectation Maximization (EM)• Two steps• E step: Fill in the expected values for the missing variables• M step: Regular maximum likelihood estimation (MLE) using the values computed in the E step and the values of the other variables• Guaranteed to converge (though only to a local minima).EM is another one of these vary general and highly popular algorithms. The key computational issue is how to derive the expectations in the E step.Forward-Backward • We already defined a forward looking variable• We also need to define a backward looking variablet(i) P(O1Otqt si))|,,()(1isOOPitnttForward-Backward • We already defined a forward looking variable• We also need to define a backward looking variablet(i) P(O1Otqt si)t(i) P(Ot 1, ,On| st i) aj,ibj(Ot 1)t 1( j)jForward-Backward • We already defined a forward looking variable• We also need to define a backward looking variable• Using these two definitions we can showt(i) P(O1Otqt si))()()()()(),,|(1iSjjiiOOsqPtdefjttttnitP(A|B)=P(A,B)/P(B))|,,()(1isOOPitnttState and transition probabilities• Probability of a state• We can also derive a transition probability ),(),,|,(11jiSoosqsqPtnjtit)()()()()(),,|(1iSjjiiOOsqPtdefjttttnitP(qt si,qt 1 sj| o1, ,on) t(i)P(qt 1 sj| qt si)P(ot 1| qt 1 sj)t 1( j)t( j)t( j)jdefSt(i, j)E step• Compute St(i) and St(i,j) for all t, i, and j (1≤t≤n, 1≤i≤k, 2≤j≤k)P(qt si,qt 1 sj| o1, ,on) St(i, j)P(qt si| O1, ,On)St(i)M step (1)Compute transition probabilities:where kjikinjina),(ˆ),(ˆ,ttjiSjin ),(),(ˆM step (2)Compute emission probabilities (here we assume a multinomial distribution):define:then
View Full Document