DOC PREVIEW
CMU BSC 03711 - lecture

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 20101HMM Lecture Notes Thursday, November 11thDannie Durand and Rose Hoberman1 Notation1. N states (S1..SN)2. M symbols in alphabet, Σ3. parameters, λ:1. initial distribution of states π(i)2. transition probabilities aij= P(qt= Si|qt−1= Sj). Note that∑Ni=1aij= 1, ∀ j3. emission probabilities ei(a) probability state i emits a4. Sequence of symbols: O = O1,O2,...,OT5. Sequence of states: Q = q1,q2,...,qT2 Using HMM’s for pattern discoverySo far, we have focused on how to use an HMM to ask questions. Now we will consider how to construct anew HMM. There are two main issues: choosing the topology and inferring the parameters for that topology.We will postpone the question of topology until the next lecture and first consider parameter estimation.2.1 Parameter estimationOnce the topology of the HMM has been determined, the parameters of the model, λ = {aij,ei(·),πi}, aredetermined from observed sequences, O1,O2,...,Ok. There are two cases to consider: labeled and unlabeleddata. In both cases, we obtain parameter values using maximum likelihood estimation.Labeled data If the sequences are labeled, the main problem is to design the topology. Once the statesand connectivity have been chosen, the transition and emission probabilities can be estimated easily usingMLE, as follows:πi=Pik, aij=Aij∑j′Aij′, ei(a) =Ei(a)∑bEi(b),where Piis the number of sequences in which the first symbol is labeled with state i, Aijis the number ofadjacent symbols labeled with states i and j, respectively, and Ei(a) is the number of instances of symbol alabeled with state i. Note that for some models, we may want to use modified versions of the above equationsto incorporate pseudocounts in the calculation of the parameters.Computational Genomics and Molecular Biology, Fall 20102Unlabeled data If the sequences are unlabeled, then it is necessary both to design the topology and to learnthe motif and the model parameters. Given sequences O1,O2,..., we wish to determine λ∗, the parametervalues that maximize the likelihood of the sequences:λ∗= argmaxλ∑dP(Od|λl) = argmaxλ∑d∑QP(Od|λl,Q)However, finding a global maximum is intractable. We would have to enumerate over all parameter sets,λ, as well as over all state paths, Q, for each sequence, Od. Instead, people settle for heuristics which areguaranteed to find at least a local maximum. Since these are heuristics, evaluation is usually done empiricallyby withholding some of the training data for testing, but we will not discuss this further.The algorithm most commonly used for estimating the parameter values, the Baum Welch algorithm, belongsto a family of algorithms called Expectation Maximization (EM) algorithms. They all work by guessinginitial parameter values, then estimating the likelihood of the data under the current parameters. Theselikelihoods can then be used to re-estimate the parameters, iteratively until a local maximum is reached.The Baum Welch algorithm learns the parameters from the data and implicitly, also discovers the motif. Todetermine the motif explicitly, we use the Viterbi algorithm on the new HMM to label the states of eachinput sequence.The intuition behind the algorithm is as follows. We are given the observed, unlabeled sequences (denotedOd= Od1,Od2,...). For a given path, Q, let L (λl) =∑dP(Od|λl,Q) be the score of the parameters λl.Choose initial values, λ = (π,aij,ei(·)).Repeat:For each sequence, OdLabel Odusing posterior decoding, P(qt= Si|Ot) = αt(i) · βt+1(i)Estimate new parameter values, λ using the method for labeled data described aboveuntil (L (λ) is stable).The process of labeling the sequences using posterior decoding and re-estimating the parameters is achievedby repeated applications of the Forward and Backward algorithms to obtain αt(i) and βt+1(i). The contri-bution of each sequence to the new parameter values should be weighted by P(Od|λ), the probability ofthe sequence under the current model parameters. This probability can also be estimated using either theForward or the Backward algorithm.More formally, for a given set of parameter values λ and a given sequence, Od, the probability of transitingfrom state i to j at time t using posterior decoding isP(qdt= i,qdt+1= j|Od,λ) = P(Od)−1· P(qdt= i,qdt+1= j,Od|λ)= P(Od)−1· αt(i)aijej(Odt+1)βt+1( j)The term αt(i) is the probability that the model has emitted symbols Od1...Odtand is in state Siat time t.This probability can be obtained using the Forward algorithm. Similarly, the Backward algorithm yieldsComputational Genomics and Molecular Biology, Fall 20103βt+1( j), the probability of emitting the rest of the sequence if we are in state j at time t + 1. The remainingtwo terms, aijand ej(Odt+1) give the probability of making the transition from i to j and emitting the t + 1stcharacter. From this we can estimateAij=∑dP(Od)−1∑tαt(i) aijei(Odt+1) βt+2( j) (1)The probability P(Od) can be estimated using current parameter values using the Forward algorithm. Simi-larly,Ei(σ) =∑dP(Od)−1∑{t|Odt=σ}αt(i) βt+1(i). (2)Pi, the number of sequences starting in state i, is given by∑dπiei(Od1)/(k · P(Od)), where k is the totalnumber of sequences. From Pi, Aij, and Ei(σ) we re-estimate the parameters. A formal statement of theBaum Welch algorithm is given at the end of these notes.Convergence: It can be proven that if current estimate is replaced by these new estimates then the likelihoodof the data will not decrease (i.e. will increase unless already at a local maxima/critical point). See Durbin,Section 11.6 for discussion of avoiding local maxima and other typical pitfalls with this algorithm.2.2 TopologyFinally, we must consider how to design the topology of the HMM. This includes the set of states, S1,...,SN,and how they are connected; in other words, we must specify which states will be connected by edges withnon-zero values of aij. We must also choose the alphabet and decide which states will emit symbols.We could just choose a fully connected graph, but this has too many parameters too estimate. Instead we canexploit domain knowledge. Choose a topology that limits the number of states and edges while still beingexpressive enough to represent the relationships they believe to exist.The transmembrane models we discussed at the end of October, illustrate some of the issues to consider.Recall that the goal was to model a sequence that started and


View Full Document

CMU BSC 03711 - lecture

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?