DOC PREVIEW
CMU BSC 03711 - Lecture Notes

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2011 1HMM Lecture Notes October 11thDannie Durand and Rose Hoberman1 Hidden Markov ModelsIn the last few lectures, we have focussed on three problems related to local multiple sequencealignments:• Discovery• Modeling• RecognitionIn the last few lectures, we have discussed the use of Hidden Markov Models (HMMs) to modelmotifs. In today’s lecture, we discussed two algorithms that are used to solve the recognitionproblem with HMMs: the Viterbi algorithm and the forward algorithm. The problem of motifdiscovery is left for future lectures.HMMs are defined formally as follows:1. N states S1..SN2. M symbols in alphabet3. Transition probability matrix aij4. Emission probabilities ei(a) probability state i emits character a.5. Initial distribution vector π = (πi)We refer to the emission probabilities, the emission probabilities and the initial distribution, col-lectively as the parameters, designated λ = (aij, ei(a), π). Following the notation used in Durbin,we will refer to the sequence of observed symbols as O = O1, O2, O3, ... and the sequence of s tatesvisited as Q = q1, q2, q3, ... (the “state path”.) When considering more than one sequence or statepath, we will use superscripts to distinguish them: Oi= Oi1, Oi2, Oi3, ... and Qj= qj1, qj2, qj3, ...The parameters of an HMM can be learned from labeled data:aij=AiPj0Aij0ei(x) =Ei(x)P0xEi(x0),where Aijis the number of transitions from state Sito Sjin the training data and Ei(x) is thenumber of instances of the symbol x that are labeled with state Si.Computational Genomics and Molecular Biology, Fall 2011 2In an HMM, each state emits symbols from a fixed alphabet each time a state is visited. Emissionprobabilities are state-dependent, but not time-dependent. a symbol may be emitted by more thanone state. Similarly, a state can emit more than one symbol.Note that an HMM is a generative model. It gives the probability of generating a particularsequence (hence, the emission probabilities.) This allows us to ask: Given an observed sequence,O, and an HMM model, what is the probability that O was generated by this HMM?In an HMM, there may be more than one, and often very many, state paths associated with O.Therefore, the “true” sequence of states that generated the observed sequence is unknown, orhidden, hence the term, “Hidden” Markov model. The sequences are hidden because it is notpossible to tell the state merely by the output symbol. This hidden sequence of states correspondsto what we want to know, namely the classification of each symbol.1.1 HMM’s and probabilitiesAn HMM emits all sequences Oi∈ Σ∗with probability P (Oi) ≥ 0. Every time an HMM is executedit will emit a sequence; therefore,XiP (Oi) = 1.Since each sequence can, potentially, be from several state paths, P (O) =PjP (O|Qj) · P (Qj).Fig. 1 below shows P (O, Q) for the set of all possible state paths, Q.From this, we obtainXiP (Oi) = 1XiXjP (Oi|Qj) · P (Qj) = 1XiXjP (Oi, Qj) = 1Computational Genomics and Molecular Biology, Fall 2011 3Figure 1: The probability of sequence O and state path Q. The area under this curve is P (O).The maximum point on the curve is the most likely path, Q∗.2 Recognition questionsGiven a sequence sequence, O, there are several questions we may wish to ask:1. What is the true path? Otherwise stated, we wish to assign labels to an unlabeled sequence.Example: Identify the cytosolic, transmembrane, and extracellular regions in the sequence.In this case, we wish to assign the lab e ls E, M, or C to the unlabeled data.2. What is the probability that a given sequence O, was generated by the HMM?Example: Is the sequence a transmembrane protein?For the former question, we use the Viterbi algorithm to find the most probable path through theHMM, given O, which we take as an estimate of the true path. For the second question, we usethe Forward algorithm to calculate P (O).2.1 The Viterbi algorithm:In the transmembrane example, the amino acid sequence is known, but the subcellular location ofeach residue is hidden. In the HMM model, the subcellular location of each residue is representedby the (hidden) state that emitted the symbol associated with that residue. We will infer thesubcellular locations of the residues by inferring the sequence of states visited by the HMM. Theboundaries between the transmembrane, extracellular and c ytosolic regions are defined by thetransitions between C, M, and E states along this state path.Computational Genomics and Molecular Biology, Fall 2011 4We approach this problem by assuming that most likely path, Q∗(0) = argmaxQP (Q|O), is a goodestimation of the sequence of states that generated a given observed sequence O. This process iscalled “decoding” because we decode the sequence of symbols to determine the hidden sequence ofstates. HMMs were developed in the field of sp ee ch recognition, where recorded speech is “decoded”into words or phonemes to determine the meaning of the utterance.We could use brute force by calculating P (Q|O ) for all paths. However, this becomes intractableas soon as number of states gets larger, since the number of state sequences grows exponentially(NT). Instead, we calculate argmaxQP (Q, O) using a dynamic programming algorithm called theViterbi algorithm. Note, that this will still give us the most probable path because the path thatmaximizes P (Q, O) also maximizes P (Q|O):argmaxQP (Q|O) = argmaxQP (Q, O)P (O)= argmaxQP (Q, O).Let δ(t, i) be the probability of observing the first t residues and ending up in state Sivia the mostprobable path. We calculate δ(t, i) as follows:Initialization:δ(1, i) = πiei(O1)Recursion:δ(t, i) = max1<=j<=Nδ(t − 1, j) · aji· ei(Ot)Final: The probability of the most probable pathP (Q∗) = max1<=i<=Nδ(T, i)The final state, qTis the state that maximizes δ(T, i). The state path can be reconstructed by trac-ing back through the dynamic programming matrix, similar to the traceback in pairwise sequencealignment.Running time: O(T N2)There are T N entries in the dynamic programming matrix. Each entry requires calculating Nterms.2.2 The Forward AlgorithmWe may also wish to determine the probability of a particular sequence, O, being generated tocompare likelihood of different sequences or compare the likelihood of one sequence under differentComputational Genomics and Molecular Biology, Fall 2011 5models. For example, suppose we want to compute the probability of observing


View Full Document

CMU BSC 03711 - Lecture Notes

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?