CMU BSC 03711 - lecture - D2379254

Home> Schools> Carnegie Mellon University> Biological Sciences (BSC) > BSC 03711> lecture

DOC PREVIEW

CMU BSC 03711 - lecture

School name Carnegie Mellon University

Course Bsc 03711- Computational Molecular Biology and Genomics

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2005 1HMM Lecture Notes Tuesday, November 3rdRose Hoberman and Dannie Durand1 OverviewIn the past two weeks, we have discussed algorithms for recognizing new patterns, given an HMMfor a pattern of interest. We now return to the questions of modeling and discovery.PSSM’s can capture the statistical properties of the family, but they have the following limitations:1. They cannot capture positional dependencies.2. They do not handle variable length patterns well.3. They cannot recognize indels in the query sequence.In contrast, HMMs handle gaps and pairwise dependencies well.2 Notation1. N states (S1..SN)2. M symbols in alphabet, Σ3. parameters, λ:1. initial distribution of states π(i)2. transition probabilities aij= P (qt= Si|qt−1= Sj)PNi=1aij= 1, ∀j3. emission probabilities ei(a) probability state i e mits a4. observation sequence: O = O1, O2, ..., OT5. sequence of observed states: Q = q1, q2, ..., qT3 Topology• Characteristics: number of nodes, alphabet, which edges to consider. We could just choose afully connected graph, but this has too many paramete rs too estimate.• Instead we can exploit domain knowledge. Choose a topology that limits the number ofstates and edges while still being expressive enough to represe nt the relationships they believeto exist.• The choice of topology can impose a probability distribution on the length of the sequencesthat the HMM recognizes. For example, a simple self loop with probability p results inComputational Genomics and Molecular Biology, Fall 2005 2an exponentially decaying (geometric) distribution P (l residues) = (1 − p)pl−1. There aretopologies that assume other length distributions (see Durbin, 3.4 for more on this subject).A basic topology:Suppose we wish to construct an HMM for the WEIRD motif, based on the following alignment whichhas no gaps and no positional dependencies:WEIRDWEIRDWEIREWEIQHWe can recognize the WEIRD motif using an HMM with this topology:where the transitions probabilities are ai,j= 1 if j = i + 1 and zero, otherwise. The emissionprobabilities are ej(α) = F [α, j], where F [α, j] is the same frequency matrix that we derived forthe PSSM example, using pseudocounts. The Start and End states (M0and M6) are silent. Theabove model is our alternate hypothesis, HA.To score a new sequence, we also need a background model (the null hypothesis, H0):In this model, all transition probabilities are equal to one. The emission probabilities are ej(α) =p(α), where p(α) is the background frequency of residue α. We can then score a new sequence, O,by calculating logP (O|HA)P (O|H0). In class, we saw that we obtain a score equivalent toP5i=1S[oi, i], thescore we would have obtained with the PSSM for the WEIRD motif.Computational Genomics and Molecular Biology, Fall 2005 3Positional dependencies:Now suppose that our motif has a positional dependency like this one, in which we see either RD orQH in the last two positions, but never QD or RH.WEIRDWEIRDWEIQHWEIRDWEIQHWEIQHA PSSM for this motif, however, would give the sequences WEIRD and WEIRH equally good scores.So would the basic HMM above. We can construct an HMM to model this pairwise dependencylike this:where the emission probabilities are eM4(R) = 1, eM5(D) = 1, eM04(Q) = 1 and eM05(H) = 1.Insertions:We can modify the basic HMM to recognize query sequences with insertions such as O = WECIRD:Computational Genomics and Molecular Biology, Fall 2005 4where the emission probabilities for the insertion states are the background frequencies.Deletions:Suppose our query sequences has a deletion, e.g., O = WERD. One approach to capturing suchdeletions would be to add edges allowing us to jump over any set of match states:The disadvantage to this approach is that we would nee d a very large set of training data to inferthe transitions, one in which all deletions of all possible sizes were represented. Instead, we canmodel long deletions as sequences of short ones, as seen belowComputational Genomics and Molecular Biology, Fall 2005 54 Profile HMMsA Profile HMM is a standard topology for modeling sequence motifs. It was proposed by Kroghand Haussler in 1994. This is a profile HMM for a pattern of length five.A profile HMM of length 5Each insert and match state emits the 20 AA and delete emits “-”. The emission and transitionprobabilities must be estimated from data.Parameter estimation:Given labeled training data (i.e., we are given the state path), we use maximum likelihood toestimate the parameters. In general,ek(σ) =Ei(σ) + bPjEj(σ) + 20ba(i, j) =A(i, j)PlA0(i, l)where Ei(σ) is the number of instances in the training data where symbol σ is emitted in statei and A0(i, j) is the number of transitions from i to j in the training data plus a pseudocount totake transitions that are not observed into account. Zeros are bad for estimation because we wanta distribution over all p os sible sequences with a peak for family members. Unless there are manyfamily members, use smoothing (simplest is to add pseudocounts).Computational Genomics and Molecular Biology, Fall 2005 6For our Profile HMM, the estimation of the emission probabilities might look like this:eM6(i) = eM0(i) = 0∀ieIk(i) = qi∀ikeDk(i) = 0 eDk(“−00) = 1eMk(i) =Ek(i) + bPjEk(j) + 20bAn example: A profile HMM for a variable length motifProfile HMM’s like the one above can be used to model variable length motifs, such as this one:VG--HV---NVE--DIAADNThe length of the HMM should be the average of the length of the sequences. The above sequencesare of lengths 3, 2, 3 and 5, respectively, yielding an average of 3.25. Our HMM will have a silentstart state M0, match states M1, M2, M3, insertion states I0, I1, I2, I3, deletion states D1, D2, D3and a silent end state M4.In order to estimate the parameters, we need to assign labels to the data using the multiplealignment. Positions in the alignment that have gaps in less than 50% of the rows correspond tomatch states . Those with more than 50% gaps correspond to insertion states:V G - - HV - - - NV E - - DI A A D NM1M2I2I2M3Computational Genomics and Molecular Biology, Fall 2005 7This yields the following labeled sequences:V G HM1M2M3V HM1D2M3V E DM1M2M3I A A D NM1M2I2I2M3From these labeled sequences, we can estimate the parameters. For example,eM1(V ) =3 + 14 + 20andaM2I2=1 + 1(2 + 1) + (1 + 1) + (0 + 1)The three sums in the denominator correspond to all possible transitions out of state M2,

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4 out of 11 pages.

CMU BSC 03711 - lecture

Sign up for free to view:

Please select your school