DOC PREVIEW
CMU BSC 03711 - Lecture Notes

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2005 1HMM Lecture Notes October 25thRose Hoberman and Dannie DurandOutline1. Local Multiple Sequence Alignment - overview2. Limitations of Position Specific Scoring Matrices (PSSM’s)3. Markov Chains (Review)4. Introduction to Hidden Markov Models (HMMs); Trans-Membrane Prediction1 Local Multiple Sequence Alignment - overview• Discovery: Given k unaligned sequences, find a shared pattern or “motif”. Last Thursdaywe discussed a discovery method called the Gibbs Sampler. We will see in the next fewlectures that HMM’s can also be used for discovery.• Modeling/Representation: Given an aligned motif, construct a probabilistic model. For-malisms for such models include PSSM’s and HMMs.• Recognition: Given a new sequence, use a probabilistic model to determine whether thesequence contains instances of the motif.2 Limitations of Position Specific Scoring Matrices (PSSM’s)A PSSM is a |Σ|xk matrix that gives the log odds score of a sequence fragment of length w. Thescore corresponds to the ratio of the likelihood of the sequence given the alternate hypothesis tothe likelihood of the sequence given the null hypothesis. Here, the alternate hypothesis is that thesequence is a member of the motif family. The null hypothesis is that the sequence is derived fromthe background distribution.Given a new sequence, we search for new instances of the motif, by testing all subsequences oflength w. This is equivalent to passing a window of length w over the sequence, starting at position1 and ending at position n − w + 1, and applying the PSSM to each window.This approach has several limitations:• It cannot capture positional dependencies. Suppose, in a particular motif, we always seeeither RD or QH at positions j and j + 1, but never QD or RH. This is an example of a patternthat a PSSM can not represent.Computational Genomics and Molecular Biology, Fall 2005 2• It is difficult to recognize instances of the pattern that contain insertions or deletions.• PSSM’s are not well suited to represent variable length patterns.• This approach is not well suited for detecting sharp boundaries between two r egions.In a boundary detection problem, we are given a sequence and we wish to label each symbol in thesequence according to its class. Examples of some boundary detection problems include:• Recognition of regulatory motifs.• Recognition of protein domains• Gene recognition• Intron/exon boundaries• Transmembrane regions• α helices, β sheets, etc.3 Markov Chains3.1 Intro/ReviewFinite state, discrete Markov Chains: A system can be in any one of a finite set of states ata given time. A Markov chain is a probabilistic model of the progression of the system a sequenceof states. For example, the Jukes Cantor model i s a Markov chain that models the evolution ofa single position in a sequence over time. The transmembrane Markov chains introduced belowmodel the appearance of amino acid residues as we progress through a peptide sequence.Markov assumption: At each step in the sequence, the state depends only on the previous stateP (qt= Sj|qt−1= Si, qt−2= Sk, ...) = P (qt= Si|qt−1= Sj) = aijComponents1. States (S1..SN)2. Initial distribution of states π(i)3. Transition probabilities aij= P (qt= Sj|qt−1= Si)PNi=1aij= 1, ∀j. (Note that in theJukes Cantor and gambling examples, we used the notation Pijfor the transition probabili-ties.)Computational Genomics and Molecular Biology, Fall 2005 33.2 Simple ExamplesIn previous lectures, we covered two simple Markov chains:1. gambling (simple random walk)2. Jukes-Cantor model of nucleic acid sequence evolution3.3 Recognizing Trans-Membrane RegionsConsider a model where each amino acid is encoded as either hydrophobic (H) or hydrophilic(L). Thus, any peptide sequence can be represented as a sequence of H’s and L’s. We can modeltransmembrane regions as a two state Markov chain, with states H and L and transitions P (H|H),etc.Transmembrane model:The transition probabilities of models like this one are typically learned from data. More aboutthat in the next lecture.Some questions we can ask using Markov chains:• What’s the probability of being in a particular state at a particular time t?• What is the probability of seeing a particular sequence of states?P (HHHHHH) = π(H) ∗ P (qt= H|qt−1= H) ∗ ...P (LHLHLH) = π(L) ∗ P (qt= H|qt−1= L) ∗ P (qt= L|qt−1= H)...Computational Genomics and Molecular Biology, Fall 2005 4Note that for a Markov chain, this is the same as the probability of seeing a particularsequence of hydrophobic and hydrophilic residues.Order: O(T ) (where T is length of sequence)• What is the probability that this sequence is a transmembrane sequence?We could use the Markov chain to determine the probability of a particular sequence andcompare it to a predetermined threshold to decide if the sequence is transmembrane or not.This raises several issues. First, we need a model of the null hypothesis. Second, how shouldwe choose the threshold? Also, the threshold would have to be length dependent since thelonger the sequence, the smaller its probability.To model the null hypothesis, we can build a second two-state models representing amino acidsequences from the excellular and cytosol regions:Extracellular/cytosol model:Using this model, we can classify an observed sequence, O = O1, O2, . . ., by its log odds ratioS(O) = log(P (O|model+)P (O|model−)=LXi=1loga+xi−1xia−xi−1xi3.4 Predicting Trans-Membrane Regions in ContextIn the discussion above, we assumed that every sequence either was or was not a transmembranesequence. Ideally, given a complete protein sequence, we would like to find the locations of the thetransmembrane regions.How could we do this with a Markov chain? We could pass a window of fixed size over the sequenceand calculate the log odds ratio for each window position, setting some significance threshold forComputational Genomics and Molecular Biology, Fall 2005 5classification. However, we would have to select the window size and decision threshold arbitrarily(and depends on length). This is unattractive because, we expect membrane-spanning regions tobe of variable length with sharp boundaries.Solution? Combine the two models to c reate a Hidden Markov model that captures the transitionsbetween the transmembrane and extracellular/cytosol regions.Four-state transmembrane HMM:This model diffes from a Markov Chains because


View Full Document

CMU BSC 03711 - Lecture Notes

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?