CMU BSC 03711 - Lecture Notes - D1279817

Home> Schools> Carnegie Mellon University> Biological Sciences (BSC) > BSC 03711> Lecture Notes

DOC PREVIEW

CMU BSC 03711 - Lecture Notes

School name Carnegie Mellon University

Course Bsc 03711- Computational Molecular Biology and Genomics

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2011 1HMM Lecture Notes October 11thDannie Durand and Rose Hoberman1 Hidden Markov ModelsIn the last few lectures, we have focussed on three problems related to local multiple sequencealignments:• Discovery• Modeling• RecognitionIn the last few lectures, we have discussed the use of Hidden Markov Models (HMMs) to modelmotifs. In today’s lecture, we discussed two algorithms that are used to solve the recognitionproblem with HMMs: the Viterbi algorithm and the forward algorithm. The problem of motifdiscovery is left for future lectures.HMMs are defined formally as follows:1. N states S1..SN2. M symbols in alphabet3. Transition probability matrix aij4. Emission probabilities ei(a) probability state i emits character a.5. Initial distribution vector π = (πi)We refer to the emission probabilities, the emission probabilities and the initial distribution, col-lectively as the parameters, designated λ = (aij, ei(a), π). Following the notation used in Durbin,we will refer to the sequence of observed symbols as O = O1, O2, O3, ... and the sequence of s tatesvisited as Q = q1, q2, q3, ... (the “state path”.) When considering more than one sequence or statepath, we will use superscripts to distinguish them: Oi= Oi1, Oi2, Oi3, ... and Qj= qj1, qj2, qj3, ...The parameters of an HMM can be learned from labeled data:aij=AiPj0Aij0ei(x) =Ei(x)P0xEi(x0),where Aijis the number of transitions from state Sito Sjin the training data and Ei(x) is thenumber of instances of the symbol x that are labeled with state Si.Computational Genomics and Molecular Biology, Fall 2011 2In an HMM, each state emits symbols from a fixed alphabet each time a state is visited. Emissionprobabilities are state-dependent, but not time-dependent. a symbol may be emitted by more thanone state. Similarly, a state can emit more than one symbol.Note that an HMM is a generative model. It gives the probability of generating a particularsequence (hence, the emission probabilities.) This allows us to ask: Given an observed sequence,O, and an HMM model, what is the probability that O was generated by this HMM?In an HMM, there may be more than one, and often very many, state paths associated with O.Therefore, the “true” sequence of states that generated the observed sequence is unknown, orhidden, hence the term, “Hidden” Markov model. The sequences are hidden because it is notpossible to tell the state merely by the output symbol. This hidden sequence of states correspondsto what we want to know, namely the classification of each symbol.1.1 HMM’s and probabilitiesAn HMM emits all sequences Oi∈ Σ∗with probability P (Oi) ≥ 0. Every time an HMM is executedit will emit a sequence; therefore,XiP (Oi) = 1.Since each sequence can, potentially, be from several state paths, P (O) =PjP (O|Qj) · P (Qj).Fig. 1 below shows P (O, Q) for the set of all possible state paths, Q.From this, we obtainXiP (Oi) = 1XiXjP (Oi|Qj) · P (Qj) = 1XiXjP (Oi, Qj) = 1Computational Genomics and Molecular Biology, Fall 2011 3Figure 1: The probability of sequence O and state path Q. The area under this curve is P (O).The maximum point on the curve is the most likely path, Q∗.2 Recognition questionsGiven a sequence sequence, O, there are several questions we may wish to ask:1. What is the true path? Otherwise stated, we wish to assign labels to an unlabeled sequence.Example: Identify the cytosolic, transmembrane, and extracellular regions in the sequence.In this case, we wish to assign the lab e ls E, M, or C to the unlabeled data.2. What is the probability that a given sequence O, was generated by the HMM?Example: Is the sequence a transmembrane protein?For the former question, we use the Viterbi algorithm to find the most probable path through theHMM, given O, which we take as an estimate of the true path. For the second question, we usethe Forward algorithm to calculate P (O).2.1 The Viterbi algorithm:In the transmembrane example, the amino acid sequence is known, but the subcellular location ofeach residue is hidden. In the HMM model, the subcellular location of each residue is representedby the (hidden) state that emitted the symbol associated with that residue. We will infer thesubcellular locations of the residues by inferring the sequence of states visited by the HMM. Theboundaries between the transmembrane, extracellular and c ytosolic regions are defined by thetransitions between C, M, and E states along this state path.Computational Genomics and Molecular Biology, Fall 2011 4We approach this problem by assuming that most likely path, Q∗(0) = argmaxQP (Q|O), is a goodestimation of the sequence of states that generated a given observed sequence O. This process iscalled “decoding” because we decode the sequence of symbols to determine the hidden sequence ofstates. HMMs were developed in the field of sp ee ch recognition, where recorded speech is “decoded”into words or phonemes to determine the meaning of the utterance.We could use brute force by calculating P (Q|O ) for all paths. However, this becomes intractableas soon as number of states gets larger, since the number of state sequences grows exponentially(NT). Instead, we calculate argmaxQP (Q, O) using a dynamic programming algorithm called theViterbi algorithm. Note, that this will still give us the most probable path because the path thatmaximizes P (Q, O) also maximizes P (Q|O):argmaxQP (Q|O) = argmaxQP (Q, O)P (O)= argmaxQP (Q, O).Let δ(t, i) be the probability of observing the first t residues and ending up in state Sivia the mostprobable path. We calculate δ(t, i) as follows:Initialization:δ(1, i) = πiei(O1)Recursion:δ(t, i) = max1<=j<=Nδ(t − 1, j) · aji· ei(Ot)Final: The probability of the most probable pathP (Q∗) = max1<=i<=Nδ(T, i)The final state, qTis the state that maximizes δ(T, i). The state path can be reconstructed by trac-ing back through the dynamic programming matrix, similar to the traceback in pairwise sequencealignment.Running time: O(T N2)There are T N entries in the dynamic programming matrix. Each entry requires calculating Nterms.2.2 The Forward AlgorithmWe may also wish to determine the probability of a particular sequence, O, being generated tocompare likelihood of different sequences or compare the likelihood of one sequence under differentComputational Genomics and Molecular Biology, Fall 2011 5models. For example, suppose we want to compute the probability of observing

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

CMU BSC 03711 - Lecture Notes

Sign up for free to view:

Please select your school