DOC PREVIEW
CMU BSC 03711 - Lecture

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2006 1HMM Lecture Notes Thursday, November 2ndRose Hoberman and Dannie Durand1 OverviewThe last two lectures on HMMs deal with the modeling and discovery functions of HMMs. We aregiven observed sequences, O1, O2, ..., Ok, and wish to construct an HMM with parameters, λ, tomodel these sequences.If the sequences are labeled, the main problem is to design the topology. Once the states andconnectivity have been chosen, the transition and emission probabilities can be estimated easilyusing MLE. If the sequences are unlabeled, then it is necessary both to design the topology and tolearn the motif and the model parameters. The Baum Welch algorithm will learn the parametersfrom the data and implicitly, also discovers the motif. To determine the motif explicitly, we usethe Viterbi algorithm on the new HMM to label the states of each input sequence.In the previous lecture, we discussed the Baum Welch algorithm. In the this l ecture, we discusshow designing t he topology of an HMM and the Profile HMM model.2 Notation1. N states (S1..SN)2. M symbols in alphabet, Σ3. parameters, λ:1. initial distribution of states π(i)2. transition probabilities aij= P (qt= Si|qt−1= Sj). Note thatPNi=1aij= 1, ∀j3. emission probabilities ei(a) probability state i emits a4. Sequence of symbols: O = O1, O2, ..., OT5. Sequence of states: Q = q1, q2, ..., qT3 Topology• Characteristics: number of nodes, alphabet, which edges to consider. We could just choose afully connected graph, but this has too many parameters too estimate.• Instead we can exploit domain knowledge. Choose a topology that limits the number ofstates and edges while still being expressive enough to represent the relationships they believeto exist.Computational Genomics and Molecular Biology, Fall 2006 2• The choice of topology can impose a probability distribution on the length of the sequencesthat the HMM recognizes. For example, a simple self loop with probability p results inan exponentially decaying (geometric) distribution P (l residues) = (1 − p)pl−1. There aretopologies that assume other length distributions (see Durbin, 3.4 for more on this subject).A basic topology:Suppose we wish to construct an HMM for the WEIRD motif, based on the following alignment whichhas no gaps and no positional dependencies:WEIRDWEIRDWEIREWEIQHWe can recognize the WEIRD motif using an HMM with this topology:where the transitions probabilities are ai,j= 1 if j = i + 1 and zero, otherwise. The emissionprobabilities are ej(α) = F [α, j], where F [α, j] is the same frequency matrix that we derived forthe PSSM example, using pseudocounts. The Start and End states (M0and M6) are silent. Theabove model is our alternate hypothesis, HA.Computational Genomics and Molecular Biology, Fall 2006 3To score a new sequence, we also need a background model (the null hypothesis, H0):In this model, all transition probabilities are equal to one. The e miss ion probabilities are ej(α) =p(α), where p(α) is the background frequency of residue α. We can then score a new sequence, O,by calculating logP (O|HA)P (O|H0). We obtain a score equivalent toP5i=1S[oi, i], the s core we would haveobtained with the PSSM for the WEIRD motif.Positional dependencies:Now suppose that our motif has a positional dependency like this one, in which we see either RD orQH in the last two positions, but never QD or RH.WEIRDWEIRDWEIQHWEIRDWEIQHWEIQHA PSSM for this motif, however, would give the sequences WEIRD and WEIRH equally good scores.So would the basic HMM above. We can construct an HMM to model this pairwise dependencylike this:where the emission probabilities are eM4(R) = 1, eM5(D) = 1, eM04(Q) = 1 and eM05(H) = 1.Computational Genomics and Molecular Biology, Fall 2006 4Insertions:We can modify the basic HMM to recognize query sequences with insertions such as O = WECIRD:where the emission probabilities for the insertion states are the background frequencies.Deletions:Suppose our query sequences has a deletion, e.g., O = WERD. One approach to capturing suchdeletions would be to add edges allowing us to jump over any set of match states:The disadvantage to this approach is that to infer the transitions, we would need a very large setof training data, one in which all deletions of all p os sible sizes were represented. Instead, we canmodel long deletions as sequences of short ones, as seen belowComputational Genomics and Molecular Biology, Fall 2006 54 Profile HMMsA Profile HMM is a standard topology for modeling sequence motifs. It was proposed by Kroghand Haussler in 1994.A profile HMM of length 5Each Insertion and Match state emits the 20 AA and delete emits “-”. The emission and transitionprobabilities must be estimated from data.Parameter estimation:Given labeled training data (i.e., we are given the s tate path), we use maximum likelihood toestimate the parameters. In general,ek(σ) =Ei(σ) + bPjEj(σ) + 20ba(i, j) =A(i, j)PlA0(i, l)where Ei(σ) is the number of instances in the training data where symbol σ is emitted in state iand A0(i, j) is the number of transitions from i to j in the training data plus a pseudocount to taketransitions that are not observed into account.For our Profile HMM, the estimation of the emission probabilities might look like this:eM6(i) = eM0(i) = 0∀iComputational Genomics and Molecular Biology, Fall 2006 6eIk(i) = pi∀ikeDk(i) = 0 eDk(“−00) = 1eMk(i) =Ek(i) + bPjEk(j) + 20bwhere piis the background frequency of residue i.An example: A profile HMM for a variable length motifProfile HMM’s like the one above can be used to model variable length motifs, such as this one:VG--HV---NVE--DIAADNThe length of the HMM should be the average of the length of the sequences. The above sequencesare of lengths 3, 2, 3 and 5, respectively, yielding an average of 3.25. Our HMM will have a silentstart state M0, match states M1, M2, M3, insertion states I0, I1, I2, I3, deletion states D1, D2, D3and a silent end state M4.In order to estimate the parameters, we need to assign labels to the data using the multiplealignment. Positions in the alignment that have gaps in less than 50% of the rows correspond tomatch states. Those with more than 50% gaps correspond to insertion states:V G - - HV - - - NV E - - DI A A D NM1M2I2I2M3Computational Genomics and Molecular Biology, Fall 2006 7This yields the following labeled sequences:V G HM1M2M3V HM1D2M3V E DM1M2M3I A A D NM1M2I2I2M3From these labeled sequences,


View Full Document

CMU BSC 03711 - Lecture

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?