CMU BSC 03711 - Lecture - D225054

Home> Schools> Carnegie Mellon University> Biological Sciences (BSC) > BSC 03711> Lecture

DOC PREVIEW

CMU BSC 03711 - Lecture

School name Carnegie Mellon University

Course Bsc 03711- Computational Molecular Biology and Genomics

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Computational Genomics and Molecular Biology, Fall 2006 1HMM Lecture Notes Tuesday, October 31Rose Hoberman and Dannie Durand1 OverviewIn the past two weeks, we have discussed algorithms for recognizing new patterns, given an HMMfor a pattern of interest. We now return to the questions of modeling and discovery.PSSM’s can capture the statistical properties of the family, but they have the following limitations:1. They cannot capture positional dependencies.2. They do not handle variable length patterns well.3. They cannot recognize indels in the query sequence.In contrast, HMMs handle gaps and pairwise dependencies well.2 Notation1. N states (S1..SN)2. M symbols in alphabet, Σ3. parameters, λ:1. initial distribution of states π(i)2. transition probabilities aij= P (qt= Si|qt−1= Sj). Note thatPNi=1aij= 1, ∀j3. emission probabilities ei(a) probability state i em its a4. Sequence of symbols: O = O1, O2, ..., OT5. Sequence of states: Q = q1, q2, ..., qT3 Baum WelchIn general, if we have labeled data (that is, we know the state sequence), we can obtain theparameters using Maximum Likelihood Estimation. The Baum-Welch algorithm is used to estimatethe model parameters when the state path is unknown. Given sequences O1, O2, . . ., we wish todetermine λ = {aij, ei(·), πi}. We generally want to choose parameters that will maximize thelikelihood of our data.However, finding a global m aximum is intractable. We would have to enumerate over all parametersets, λk, and then calculateComputational Genomics and Molecular Biology, Fall 2006 2Score(λk) =XdP (Od|λk) =XdXQP (Od|λk, Q)for each λk. Instead, people settle for heuristics which are guaranteed to find at least a localmaximum. Since these are heuristics, evaluation is usually done em pirically by witholding some ofthe training data for testing, but we will not discuss this further.The algorithm (BW) used for selecting the parameter values belongs to a family of algorithmscalled Expectation Maximization (EM) algorithms. They all work by guessing initial parametervalues, then estimating the like lihood of the data under the current parameters. These likelihoodscan then be used to re-estimate the paremeters, iteratively until a local maximum is reached.Setup The alphabet and the number of states, N, are fixed. We are given the observed sequences(denoted Od= Od1, Od2, ...).The intutition behind the algorithm is as follows1. Choose some initial values for λ(π, aij, ei(·)).2. Determine “probable paths” Qd= qd1, qd2, ...3. Count the expected number of transitions, Aij, from s tate i to state j, given the currentestimate of λ.4. Count, Ei(σ), the expected number of times character σ is emitted from state i.5. Re-estimate λ(π, aij, ei(·)) from Aijand Ei(σ),6. if not converged, go to step 2For a given sequence, Od, probability of transiting from state i to j at time t isP (qdt= i, qdt+1= j|Od, λ) =P (qdt= i, qdt+1= j, Od)P (Od)=αt(i)aijej(Odt+1)βt+1(j)P (Od)The term αt(i) is the probability that the model has emitted symbols Od1. . . Odtand is in state Siat time t. This probability can be obtained using the Forward algorithm. Similarly, the Backwardalgorithm yields βt+1(j), the probability of emitting the rest of the sequence if we are in state j attime t + 1. The remaining two terms, aijand ej(Odt+1) give the probability of making the transitionfrom i to j and emitting the t + 1st character.From this we can estimateAij=Xd1P (Od)Xtα(t, i) aijei(Odt+1) β(t + 1, i) (1)Computational Genomics and Molecular Biology, Fall 2006 3The probability of Odcan be estimated using current parameter values using the Forward algorithm.Similarly,Ei(σ) =Xd1P (Od)X{t|Odt=σ}α(t, i) β(t, i). (2)From Aijand Ei(σ) we re-estimate the parameters.Stated formally:Algorithm: Baum WelchInput:A set of observed sequences, O1, O2, . . .Initialization:Select arbitrary model parameters, λ0= aij, ei().score =PdP (Od|λ0).Repeat{λ = λ0, S = S0For each sequence, Od,{/* Calculate ‘‘probable paths’’ Qd= qd1, qd2, ... */Calculate α(t, i) for Odusing the Forward algorithm.Calculate β(t, i) for Odusing the Backward algorithm.Calculate the contribution of Odto A using (1).Calculate the contribution of Odto E using (2).}aij=AijPlAilei(σ) =Ei(σ)PτEi(τ )score =PdP (Od|aij, ei()).}Until (the change in score is less than some predefined threshold.)The estimation of “probable paths” Qd= qd1, qd2, ... in the inner loop is done efficiently using dynamicprogramming. This was not covered in class. This is done using the Forward and Backwardalgorithms. You are not responsible for the details of how the Forward and Backward algorithmsare used in Baum Welch, but you should understand the basic idea of the algorithm and areresponsible for knowing when Baum Welch should be applied.Convergence It can be proven that if current estimate is replaced by these new estimates then thelikelihood of the data will not decrease (i.e. will increase unless already at a local maxima/criticalComputational Genomics and Molecular Biology, Fall 2006 4point). See Durbin, Section 11.6 for discussion of avoiding local maxima and other typical pitfallswith this algorithm.4 Topology• Characteristics: number of nodes, alphabet, which edges to consider. We could just choose afully connected graph, but this has too many paramete rs too estimate.• Instead we can exploit domain knowledge. Choose a topology that limits the number ofstates and edges while still being expressive enough to represe nt the relationships they believeto exist.• The choice of topology can impose a probability distribution on the length of the sequencesthat the HMM recognizes. For example, a simple self loop with probability p results inan exponentially decaying (geometric) distribution P (l residues) = (1 − p)pl−1. There aretopologies that ass ume other length distributions (see Durbin, 3.4 for more on this subject).A basic topology:Suppose we wish to construct an HMM for the WEIRD motif, based on the following alignment whichhas no gaps and no positional dependencies:WEIRDWEIRDWEIREWEIQHWe c an recognize the WEIRD motif using an HMM with this topology:where the transitions probabilities are ai,j= 1 if j = i + 1 and zero, otherwise. The emissionprobabilities are ej(α) = F [α, j], where F [α, j] is the same frequency matrix that we derived forthe PSSM example, using pseudocounts. The Start and End states (M0and M6) are silent. Theabove model is our alternate hypothesis, HA.Computational

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

CMU BSC 03711 - Lecture

Sign up for free to view:

Please select your school