Computational Genomics and Molecular Biology, Fall 2005 1HMM Lecture Notes October 25thRose Hoberman and Dannie DurandOutline1. Local Multiple Sequence Alignment - overview2. Limitations of Position Specific Scoring Matrices (PSSM’s)3. Markov Chains (Review)4. Introduction to Hidden Markov Models (HMMs); Trans-Membrane Prediction1 Local Multiple Sequence Alignment - overview• Discovery: Given k unaligned sequences, find a shared pattern or “motif”. Last Thursdaywe discussed a discovery method called the Gibbs Sampler. We will see in the next fewlectures that HMM’s can also be used for discovery.• Modeling/Representation: Given an aligned motif, construct a probabilistic model. For-malisms for such models include PSSM’s and HMMs.• Recognition: Given a new sequence, use a probabilistic model to determine whether thesequence contains instances of the motif.2 Limitations of Position Specific Scoring Matrices (PSSM’s)A PSSM is a |Σ|xk matrix that gives the log odds score of a sequence fragment of length w. Thescore corresponds to the ratio of the likelihood of the sequence given the alternate hypothesis tothe likelihood of the sequence given the null hypothesis. Here, the alternate hypothesis is that thesequence is a member of the motif family. The null hypothesis is that the sequence is derived fromthe background distribution.Given a new sequence, we search for new instances of the motif, by testing all subsequences oflength w. This is equivalent to passing a window of length w over the sequence, starting at position1 and ending at position n − w + 1, and applying the PSSM to each window.This approach has several limitations:• It cannot capture positional dependencies. Suppose, in a particular motif, we always seeeither RD or QH at positions j and j + 1, but never QD or RH. This is an example of a patternthat a PSSM can not represent.Computational Genomics and Molecular Biology, Fall 2005 2• It is difficult to recognize instances of the pattern that contain insertions or deletions.• PSSM’s are not well suited to represent variable length patterns.• This approach is not well suited for detecting sharp boundaries between two r egions.In a boundary detection problem, we are given a sequence and we wish to label each symbol in thesequence according to its class. Examples of some boundary detection problems include:• Recognition of regulatory motifs.• Recognition of protein domains• Gene recognition• Intron/exon boundaries• Transmembrane regions• α helices, β sheets, etc.3 Markov Chains3.1 Intro/ReviewFinite state, discrete Markov Chains: A system can be in any one of a finite set of states ata given time. A Markov chain is a probabilistic model of the progression of the system a sequenceof states. For example, the Jukes Cantor model i s a Markov chain that models the evolution ofa single position in a sequence over time. The transmembrane Markov chains introduced belowmodel the appearance of amino acid residues as we progress through a peptide sequence.Markov assumption: At each step in the sequence, the state depends only on the previous stateP (qt= Sj|qt−1= Si, qt−2= Sk, ...) = P (qt= Si|qt−1= Sj) = aijComponents1. States (S1..SN)2. Initial distribution of states π(i)3. Transition probabilities aij= P (qt= Sj|qt−1= Si)PNi=1aij= 1, ∀j. (Note that in theJukes Cantor and gambling examples, we used the notation Pijfor the transition probabili-ties.)Computational Genomics and Molecular Biology, Fall 2005 33.2 Simple ExamplesIn previous lectures, we covered two simple Markov chains:1. gambling (simple random walk)2. Jukes-Cantor model of nucleic acid sequence evolution3.3 Recognizing Trans-Membrane RegionsConsider a model where each amino acid is encoded as either hydrophobic (H) or hydrophilic(L). Thus, any peptide sequence can be represented as a sequence of H’s and L’s. We can modeltransmembrane regions as a two state Markov chain, with states H and L and transitions P (H|H),etc.Transmembrane model:The transition probabilities of models like this one are typically learned from data. More aboutthat in the next lecture.Some questions we can ask using Markov chains:• What’s the probability of being in a particular state at a particular time t?• What is the probability of seeing a particular sequence of states?P (HHHHHH) = π(H) ∗ P (qt= H|qt−1= H) ∗ ...P (LHLHLH) = π(L) ∗ P (qt= H|qt−1= L) ∗ P (qt= L|qt−1= H)...Computational Genomics and Molecular Biology, Fall 2005 4Note that for a Markov chain, this is the same as the probability of seeing a particularsequence of hydrophobic and hydrophilic residues.Order: O(T ) (where T is length of sequence)• What is the probability that this sequence is a transmembrane sequence?We could use the Markov chain to determine the probability of a particular sequence andcompare it to a predetermined threshold to decide if the sequence is transmembrane or not.This raises several issues. First, we need a model of the null hypothesis. Second, how shouldwe choose the threshold? Also, the threshold would have to be length dependent since thelonger the sequence, the smaller its probability.To model the null hypothesis, we can build a second two-state models representing amino acidsequences from the excellular and cytosol regions:Extracellular/cytosol model:Using this model, we can classify an observed sequence, O = O1, O2, . . ., by its log odds ratioS(O) = log(P (O|model+)P (O|model−)=LXi=1loga+xi−1xia−xi−1xi3.4 Predicting Trans-Membrane Regions in ContextIn the discussion above, we assumed that every sequence either was or was not a transmembranesequence. Ideally, given a complete protein sequence, we would like to find the locations of the thetransmembrane regions.How could we do this with a Markov chain? We could pass a window of fixed size over the sequenceand calculate the log odds ratio for each window position, setting some significance threshold forComputational Genomics and Molecular Biology, Fall 2005 5classification. However, we would have to select the window size and decision threshold arbitrarily(and depends on length). This is unattractive because, we expect membrane-spanning regions tobe of variable length with sharp boundaries.Solution? Combine the two models to c reate a Hidden Markov model that captures the transitionsbetween the transmembrane and extracellular/cytosol regions.Four-state transmembrane HMM:This model diffes from a Markov Chains because
View Full Document