Computational Genomics and Molecular Biology, Fall 2009 1HMM Lecture Notes Thursday, November 5thDannie Durand and Rose Hoberman1 Notation1. N states (S1..SN)2. M symbols in alphabet, Σ3. parameters, λ:1. initial distribution of states π(i)2. transition probabilities aij= P (qt= Si|qt−1= Sj). Note thatPNi=1aij= 1, ∀j3. emission probabilities ei(a) probability state i emits a4. Sequence of symbols: O = O1, O2, ..., OT5. Sequence of states: Q = q1, q2, ..., qT2 Using HMM’s for pattern discoverySo far, we have focused on how to use an HMM to ask questions. Now we will consider howto construct a new HMM. There are two main issues: choosing the topology and inferring theparameters for that topology. We will postpone the question of topology until the next lecture andfirst consider parameter estimation.2.1 Parameter estimationOnce the topology of the HMM has been determined, the parameters of the model, λ = {aij, ei(·), πi},are determined from observed sequences, O1, O2, ..., Ok. There are two cases to consider: labeledand unlabeled data. In both cases, we obtain parameter values using maximum likelihood estima-tion.Labeled data If the sequences are labeled, the main problem is to design the topology. Once thestates and connectivity have been chosen, the transition and emission probabilities can be estimatedeasily using MLE, as follows:πi=Pik, aij=AijPj′Aij′, ei(a) =Ei(a)PbEi(b),where Piis the number of sequences in which the first symbol is labeled with state i, Aijis thenumber of adjacent symbols labeled with states i and j, respectively, and Ei(a) is the number ofComputational Genomics and Molecular Biology, Fall 2009 2instances of symbol a labeled with state i. Note that for some models, we may want to use modifiedversions of the above equations to incorporate pseudocounts in the calculation of the parameters.Unlabeled data If the sequences are unlabeled, then it is necessary both to design the topologyand to learn the motif and the model parameters. Given sequences O1, O2, . . ., we wish to determineλ∗, the parameter values that maximize the likelihood of the sequences:λ∗= argmaxλXdP (Od|λl) = argmaxλXdXQP (Od|λl, Q)However, finding a global maximum is intractable. We would have to enumerate over all parametersets, λ, as well as over all state paths, Q, for each sequence, Od. Instead, people settle for heuristicswhich are guaranteed to find at least a local maximum. Since these are heuristics, evaluation isusually done empirically by withholding some of the training data for testing, but we will notdiscuss this further.The algori thm most commonly used for estimating the parameter values, the Baum Welch algo-rithm, belongs to a family of algorithms called Expectation Maximization (EM) algorithms. Theyall work by guessing initial parameter values, then estimating the likeliho od of the data under thecurrent parameters. These likelihoods can then be used to re-estimate the parameters, i terativelyuntil a local maximum is reached. The Baum Welch algorithm learns the parameters from thedata and implicitly, also discovers the motif. To determine the motif explicitly, we use the Viterbialgorithm on the new HMM to label the states of each input sequence.The intuition behind the algorithm is as follows. We are given the observed, unlabeled sequences(denoted Od= Od1, Od2, ...). For a given path, Q, let L(λl) =PdP (Od|λl, Q) be the score of theparameters λl.Choose initial values, λ = (π, aij, ei(·)).Repeat:For each sequence, OdLabel Odusing posterior decoding, P (qt= Si|Ot) = αt(i) · βt+1(i)Estimate new parameter values, λ using the method for labeled data described aboveuntil (L(λ) is stable).The process of labeling the sequences using posterior decoding and re-estimating the parametersis achieved by repeated applications of the Forward and Backward algorithms to obtain αt(i) andβt+1(i). The contribution of each sequence to the new parameter values should be weighted byP (Od|λ), the probability of the sequence under the current model parameters. This probabilitycan also be estimated using either the Forward or the Backward algorithm.Computational Genomics and Molecular Biology, Fall 2009 3More formally, for a given set of parameter values λ and a given sequence, Od, the probability oftransiting from state i to j at time t using posterior decoding isP (qdt= i, qdt+1= j|Od, λ) = P (Od) · P (qdt= i, qdt+1= j, Od)= P (Od) · αt(i)aijej(Odt+1)βt+1(j)The term αt(i) is the probability that the model has emitted symbols Od1. . . Odtand is in state Siat time t. This probability can be obtained using the Forward algorithm. Similarly, the Backwardalgorithm yields βt+1(j), the probability of emitting the rest of the sequence if we are in state j attime t +1. The remaining two terms, aijand ej(Odt+1) give the probability of making the transitionfrom i to j and emitting the t + 1st character. From this we can estimateAij=XdP (Od)Xtαt(i) aijei(Odt+1) βt+2(i) (1)The probability P (Od) can be estimated using current parameter values using the Forward algo-rithm. Similarly,Ei(σ) =XdP (Od)X{t|Odt=σ}αt(i) βt+1(i). (2)Pi, the number of sequences starting in state i, is given by k ·Plπiei(Ol1), where k is the totalnumber of sequences. From Pi, Aij, and Ei(σ) we re-estimate the parameters. A formal statementof the Baum Welch algorithm is given at the end of these notes.Convergence: It can be proven that if current estimate is replaced by these new estimates then thelikelihood of the data will not decrease (i.e. will increase unless already at a local maxima/criticalpoint). See Durbin, Section 11.6 for discussion of avoiding local maxima and other typical pitfallswith this algorithm.2.2 TopologyFinally, we must consider how to design the topology of the HMM. This includes the set of states,S1, . . . , SN, and how they are connected; in other words, we must specify which states will beconnected by edges with non-zero values of aij. We must also choose the alphabet and decidewhich states will emit symbols.We could just choose a fully connected graph, but this has too many parameters too estimate.Instead we can exploit domain knowledge. Choose a topology that limits the number of states andedges while still being expressive enough to represent the relationships they believe to exist.The transmembrane models we discussed at the end of October, illustrate some of the issues toconsider. Recall that the goal was to model a sequence that started and ended in the
View Full Document