DOC PREVIEW
CMU BSC 03711 - lecture

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2006 1HMM Lecture Notes October 26thDannie Durand and Rose HobermanOutline1. Review2. Hidden Markov Models3. The Viterbi algorithm4. The Forward algorithm5. The Backward algorithm1 ReviewLast Tuesday, we introduced a boundary detection problem: We are given a sequence and we wishto lab el each symbol in the sequence according to its class (e.g. introns and exons; alpha helicesand beta sheets; genes and non-coding regions). It is difficult to identify sharp boundaries betweenclasses using by scoring windows using a fixed size model. Instead, we use Hidden Markov Models.As an ex ample, we considered the problem of recognizing transmembrane regions in a transmem-brane protein. Initially, we considered a simpler problem: Supposing we are given a sequencefragment that is either a transmembrane (TM) region or an extracellular/cytosolic region (E/C),can we determine which it is?Computational Genomics and Molecular Biology, Fall 2006 2To do this we constructed two Markov models, a TM model and an E/C model. In these models,each amino acid is encoded as either hydrophobic (H) or hydrophilic (L).Transmembrane model:Extracellular/cytosol model:First, given labeled sequences (transmembrane or not transmembrane), we determine the transitionprobabilities of the two models. We use maximum likelihood to learn parameters from data. Aijisthe number of transitions from state i to j in the training data1:aij=AijPj0Aij0In this example, we must learn four transition probabilities: aHH, aHL, aLHand aLL. Given asequence, HHHLLHLHLL... we count the number of HH pairs to determine AHHand normalizeby the number of pairs of the form H∗. The other transition probabilities are estimated similarly.1Note that for some models, we may want to incorporate pseudocounts in the calculation of the parameters.Computational Genomics and Molecular Biology, Fall 2006 3We estimate the initial probability πiby counting the number of sequences that begin with residue,i. Once we have learned the parameters, we can use the model to recognize new transmembranesequences.Using this model, we can classify an observed sequence, O = O1, O2, . . ., by its log odds ratioS(O) = logP (O|T M)P (O|EC)where P (O|T M ) = πO1QT −1i=1aOi−1Oiis the probability of the observed sequence given the TMmodel. P (O|EC) is defined similarly.The above models are useful for classifying a sequence fragment where all residues are of the sameclass (i.e., all TM or all E/C) but less useful for finding boundaries in a sequence with transitionsfrom regions of one class to regions of another class.2 Hidden Markov ModelsWe now consider the following, more difficult, problem: Given a sequence a transmembrane proteinsequence with TM regions interspersed with E/C regions, label each residue with its class (TM orE/C).To do this, we constructed the following Hidden Markov Model by adding transitions connectingthe TM and E/C mo dels. These new transitions indicate the boundaries between one class of regionand another.Four-state transmembrane HMM:Computational Genomics and Molecular Biology, Fall 2006 4HMMs are define formally as follows:1. N states S1..SN2. M symbols in alphabet3. Transition probability matrix aij4. Emission probabilities ei(a) probability state i emits character a.5. Initial distribution vector π = (πi)Components 1 - 3 specify the structure of the model, 3-5 specify parameters. We refer t the emissionprobabilities, the emission probabilities and the initial distribution, collec tively as the parameters,designated λ = (aij, ei(a), π).Notation The sequence of visited states: Q = q1, q2, q3, ... (the state sequence) The sequence ofemitted symbols: O = O1, O2, O3, ... (the observed sequence).Note that this i s a generative model. It gives the probability of generating a particular sequence(hence, the emission probabilities.) This allows us to ask, what is the probability of a sequence ofinterest, if it were generated by a particular HMM?HMMs differ from Markov chains in a number of ways:• In HMM, the sequence of states visited is hidden. Unlike Markov Chains, there is no longera one-to-one correspondence between states and output symbols. The sequences are hiddenbecause it is not possible to tell the state merely by the output symbol. This hidden sequenceof states corresponds to what we want to know, namely the classification of each symbol.• Each state emits symbols from a fixed alphabet each time a state is visited. Emission proba-bilities are state-dependent, but not time-independent.• A symbol may be emitted by more than one state. For example, in the four-state HMMabove, “H” is emitted by states HMand HE/C. Similarly, a state can emit more than onesymbol. An example of this can be seen in the three state HMM below.In the four-state model above, we used different states to distinguish between hydrophobic and hy-drophilic residues in the TM region. Another possibility is to use only one state for the transmem-brane model and use the emission probabilities distinguish between hydrophobic and hydrophilicresidues. This approach is used in the following three-state HMM. Notice that we also now distin-guish between extracellular sequences and cytosolic sequences by using separate E and C states.Computational Genomics and Molecular Biology, Fall 2006 5Three-state transmembrane HMM:Given labeled examples, we can still learn parameters for each separate model, then choose re ason-able transitions between them (or learn them given long labeled data).aij=AiPj0Aij0ei(x) =Ei(x)P0xEi(x0)Note that we now have to learn the initial probabilities, the transition probabilities and the emissionprobabilities. The parameters for this model areiE M Cπi0 0 1ei(H)0.2 0.9 0.3ei(L)0.8 0.1 0.7We assume that all transmembrane sequences start in the cytosol.3 The Viterbi algorithmIn order to find the boundaries between the transmembrane, extracellular and cytosolic regions, weseek argmaxQP (Q|O), the most probable path through the model given the observed sequence, O.We could use brute force by calculating P (Q|O) for all paths. However, this becomes intractableas soon as number of states gets larger, since the number of state sequences grows exponentially(NT).Instead, we calculate argmaxQP (Q, O) using a dynamic programming algorithm called the Viterbialgorithm. Note, that this will still give us the most probable path because the path that maximizesP (Q, O) also maximizes P (Q|O):Computational Genomics


View Full Document

CMU BSC 03711 - lecture

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?