CMU BSC 03711 - lecture - D1327578

Home> Schools> Carnegie Mellon University> Biological Sciences (BSC) > BSC 03711> lecture

DOC PREVIEW

CMU BSC 03711 - lecture

School name Carnegie Mellon University

Course Bsc 03711- Computational Molecular Biology and Genomics

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2006 1HMM Lecture Notes October 26thDannie Durand and Rose HobermanOutline1. Review2. Hidden Markov Models3. The Viterbi algorithm4. The Forward algorithm5. The Backward algorithm1 ReviewLast Tuesday, we introduced a boundary detection problem: We are given a sequence and we wishto lab el each symbol in the sequence according to its class (e.g. introns and exons; alpha helicesand beta sheets; genes and non-coding regions). It is difficult to identify sharp boundaries betweenclasses using by scoring windows using a fixed size model. Instead, we use Hidden Markov Models.As an ex ample, we considered the problem of recognizing transmembrane regions in a transmem-brane protein. Initially, we considered a simpler problem: Supposing we are given a sequencefragment that is either a transmembrane (TM) region or an extracellular/cytosolic region (E/C),can we determine which it is?Computational Genomics and Molecular Biology, Fall 2006 2To do this we constructed two Markov models, a TM model and an E/C model. In these models,each amino acid is encoded as either hydrophobic (H) or hydrophilic (L).Transmembrane model:Extracellular/cytosol model:First, given labeled sequences (transmembrane or not transmembrane), we determine the transitionprobabilities of the two models. We use maximum likelihood to learn parameters from data. Aijisthe number of transitions from state i to j in the training data1:aij=AijPj0Aij0In this example, we must learn four transition probabilities: aHH, aHL, aLHand aLL. Given asequence, HHHLLHLHLL... we count the number of HH pairs to determine AHHand normalizeby the number of pairs of the form H∗. The other transition probabilities are estimated similarly.1Note that for some models, we may want to incorporate pseudocounts in the calculation of the parameters.Computational Genomics and Molecular Biology, Fall 2006 3We estimate the initial probability πiby counting the number of sequences that begin with residue,i. Once we have learned the parameters, we can use the model to recognize new transmembranesequences.Using this model, we can classify an observed sequence, O = O1, O2, . . ., by its log odds ratioS(O) = logP (O|T M)P (O|EC)where P (O|T M ) = πO1QT −1i=1aOi−1Oiis the probability of the observed sequence given the TMmodel. P (O|EC) is defined similarly.The above models are useful for classifying a sequence fragment where all residues are of the sameclass (i.e., all TM or all E/C) but less useful for finding boundaries in a sequence with transitionsfrom regions of one class to regions of another class.2 Hidden Markov ModelsWe now consider the following, more difficult, problem: Given a sequence a transmembrane proteinsequence with TM regions interspersed with E/C regions, label each residue with its class (TM orE/C).To do this, we constructed the following Hidden Markov Model by adding transitions connectingthe TM and E/C mo dels. These new transitions indicate the boundaries between one class of regionand another.Four-state transmembrane HMM:Computational Genomics and Molecular Biology, Fall 2006 4HMMs are define formally as follows:1. N states S1..SN2. M symbols in alphabet3. Transition probability matrix aij4. Emission probabilities ei(a) probability state i emits character a.5. Initial distribution vector π = (πi)Components 1 - 3 specify the structure of the model, 3-5 specify parameters. We refer t the emissionprobabilities, the emission probabilities and the initial distribution, collec tively as the parameters,designated λ = (aij, ei(a), π).Notation The sequence of visited states: Q = q1, q2, q3, ... (the state sequence) The sequence ofemitted symbols: O = O1, O2, O3, ... (the observed sequence).Note that this i s a generative model. It gives the probability of generating a particular sequence(hence, the emission probabilities.) This allows us to ask, what is the probability of a sequence ofinterest, if it were generated by a particular HMM?HMMs differ from Markov chains in a number of ways:• In HMM, the sequence of states visited is hidden. Unlike Markov Chains, there is no longera one-to-one correspondence between states and output symbols. The sequences are hiddenbecause it is not possible to tell the state merely by the output symbol. This hidden sequenceof states corresponds to what we want to know, namely the classification of each symbol.• Each state emits symbols from a fixed alphabet each time a state is visited. Emission proba-bilities are state-dependent, but not time-independent.• A symbol may be emitted by more than one state. For example, in the four-state HMMabove, “H” is emitted by states HMand HE/C. Similarly, a state can emit more than onesymbol. An example of this can be seen in the three state HMM below.In the four-state model above, we used different states to distinguish between hydrophobic and hy-drophilic residues in the TM region. Another possibility is to use only one state for the transmem-brane model and use the emission probabilities distinguish between hydrophobic and hydrophilicresidues. This approach is used in the following three-state HMM. Notice that we also now distin-guish between extracellular sequences and cytosolic sequences by using separate E and C states.Computational Genomics and Molecular Biology, Fall 2006 5Three-state transmembrane HMM:Given labeled examples, we can still learn parameters for each separate model, then choose re ason-able transitions between them (or learn them given long labeled data).aij=AiPj0Aij0ei(x) =Ei(x)P0xEi(x0)Note that we now have to learn the initial probabilities, the transition probabilities and the emissionprobabilities. The parameters for this model areiE M Cπi0 0 1ei(H)0.2 0.9 0.3ei(L)0.8 0.1 0.7We assume that all transmembrane sequences start in the cytosol.3 The Viterbi algorithmIn order to find the boundaries between the transmembrane, extracellular and cytosolic regions, weseek argmaxQP (Q|O), the most probable path through the model given the observed sequence, O.We could use brute force by calculating P (Q|O) for all paths. However, this becomes intractableas soon as number of states gets larger, since the number of state sequences grows exponentially(NT).Instead, we calculate argmaxQP (Q, O) using a dynamic programming algorithm called the Viterbialgorithm. Note, that this will still give us the most probable path because the path that maximizesP (Q, O) also maximizes P (Q|O):Computational Genomics

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

CMU BSC 03711 - lecture

Sign up for free to view:

Please select your school