Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1 x2 x3 xK 2 1 K 2HiSeq X & NextSeqViterbi, Forward, Backward VITERBI Initialization: V0(0) = 1 Vk(0) = 0, for all k > 0 Iteration: Vl(i) = el(xi) maxk Vk(i-1) akl Termination: P(x, π*) = maxk Vk(N) FORWARD Initialization: f0(0) = 1 fk(0) = 0, for all k > 0 Iteration: fl(i) = el(xi) Σk fk(i-1) akl Termination: P(x) = Σk fk(N) BACKWARD Initialization: bk(N) = 1, for all k Iteration: bl(i) = Σk el(xi+1) akl bk(i+1) Termination: P(x) = Σk a0k ek(x1) bk(1)Variants of HMMsHigher-order HMMs • How do we model “memory” larger than one time point? • P(πi+1 = l | πi = k) akl • P(πi+1 = l | πi = k, πi -1 = j) ajkl • … • A second order HMM with K states is equivalent to a first order HMM with K2 states state H state T aHT(prev = H) aHT(prev = T) aTH(prev = H) aTH(prev = T) state HH state HT state TH state TT aHHT aTTH aHTT aTHH aTHT aHTHSimilar Algorithms to 1st Order • P(πi+1 = l | πi = k, πi -1 = j) § Vlk(i) = maxj{ Vkj(i – 1) + … } § Time? Space?Modeling the Duration of States Length distribution of region X: E[lX] = 1/(1-p) • Geometric distribution, with mean 1/(1-p) This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions X Y 1-p 1-q p qExample: exon lengths in genesSolution 1: Chain several states X Y 1-p 1-q p q X X Disadvantage: Still very inflexible lX = C + geometric with mean 1/(1-p)Solution 2: Negative binomial distribution Duration in X: m turns, where § During first m – 1 turns, exactly n – 1 arrows to next state are followed § During mth turn, an arrow to next state is followed m – 1 m – 1 P(lX = m) = n – 1 (1 – p)n-1+1p(m-1)-(n-1) = n – 1 (1 – p)npm-n X(n) p X(2) X(1) p 1 – p 1 – p p …… Y 1 – pExample: genes in prokaryotes • EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A • Negative binomial with n = 3Solution 3: Duration modeling Upon entering a state: 1. Choose duration d, according to probability distribution 2. Generate d letters according to emission probs 3. Take a transition to next state according to transition probs Disadvantage: Increase in complexity of Viterbi: Time: O(D) Space: O(1) where D = maximum duration of state F d<Df xi…xi+d-1 Pf Warning, Rabiner’s tutorial claims O(D2) & O(D) increasesViterbi with duration modeling Recall original iteration: Vl(i) = maxk Vk(i – 1) akl × el(xi) New iteration: Vl(i) = maxk maxd=1…Dl Vk(i – d) × Pl(d) × akl × Πj=i-d+1…iel(xj) F L transitions emissions d<Df xi…xi + d – 1 emissions d<Dl xj…xj + d – 1 Pf Pl Precompute cumulative valuesLearning Re-estimate the parameters of the model based on training dataTwo learning scenarios 1. Estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2. Estimation when the “right answer” is unknown Examples: GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters θ of the model to maximize P(x|θ)1. When the states are known Given x = x1…xN for which the true π = π1…πN is known, Define: Akl = # times k→l transition occurs in π Ek(b) = # times state k in π emits b in x We can show that the maximum likelihood parameters θ (maximize P(x|θ)) are: Akl Ek(b) akl = ––––– ek(b) = ––––––– Σi Aki Σc Ek(c)1. When the states are known Intuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|θ) is maximized, but θ is unreasonable 0 probabilities – BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 π = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1Pseudocounts Solution for small training sets: Add pseudocounts Akl = # times k→l transition occurs in π + rkl Ek(b) = # times state k in π emits b in x + rk(b) rkl, rk(b) are pseudocounts representing our prior belief Larger pseudocounts ⇒ Strong priof belief Small pseudocounts (ε < 1): just to avoid 0 probabilitiesPseudocounts Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: r0F = r0L = rF0 = rL0 = 1; rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded) Above #s are arbitrary – assigning priors is an art2. When the states are hidden We don’t know the true Akl, Ek(b) Idea: • We estimate our “best guess” on what Akl, Ek(b) are § Or, we start with random / uniform values • We update the parameters of the model, based on our guess • We repeat2. When the states are hidden Starting with our best guess of a model M, parameters θ: Given x = x1…xN for which the true π = π1…πN is unknown, We can get to a provably more likely parameter set θ i.e., θ that increases the probability P(x | θ) Principle: EXPECTATION MAXIMIZATION 1. Estimate Akl, Ek(b) in the training data 2. Update θ according to Akl, Ek(b) 3. Repeat 1 & 2, until convergenceEstimating new parameters To estimate Akl: (assume “| θCURRENT”, in all formulas below) At each position i of sequence x, find probability transition k→l is used: P(πi = k, πi+1 = l | x) = [1/P(x)] × P(πi = k, πi+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, πi = k, πi+1 = l, xi+1…xN) = = P(πi+1 = l, xi+1…xN | πi = k) P(x1…xi, πi = k) = = P(πi+1 = l, xi+1xi+2…xN | πi = k) fk(i) = = P(xi+2…xN | πi+1 = l) P(xi+1 | πi+1 = l) P(πi+1 = l | πi = k) fk(i) = = bl(i+1) el(xi+1) akl
View Full Document