A modeling ExampleMethylation & SilencingExample: CpG IslandsSlide 4A model of CpG Islands – (1) ArchitectureA model of CpG Islands – (2) TransitionsLog Likehoods— Telling “CpG Island” from “Non-CpG Island”Slide 8What if a new genome comes?LearningTwo learning scenarios1. When the right answer is knownSlide 14PseudocountsSlide 162. When the right answer is unknownSlide 18Estimating new parametersSlide 20The Baum-Welch AlgorithmSlide 22Alternative: Viterbi TrainingA+ C+ G+ T+A- C- G- T-A modeling ExampleCpG islands in DNA sequencesMethylation & Silencing•One way cells differentiate is methylationAddition of CH3 in C-nucleotidesSilences genes in region•CG (denoted CpG) often mutates to TG, when methylated•In each cell, one copy of X is silenced, methylation plays role•Methylation is inherited during cell divisionExample: CpG IslandsCpG nucleotides in the genome are frequently methylated(Write CpG not to confuse with CG base pair)C methyl-C TMethylation often suppressed around genes, promoters CpG islandsExample: CpG Islands•In CpG islands, CG is more frequentOther pairs (AA, AG, AT…) have different frequenciesQuestion: Detect CpG islands computationallyA model of CpG Islands – (1) ArchitectureA+ C+ G+ T+A- C- G- T-CpG IslandNot CpG IslandA model of CpG Islands – (2) TransitionsHow do we estimate parameters of the model?Emission probabilities: 1/01. Transition probabilities within CpG islandsEstablished from known CpG islands (Training Set)2. Transition probabilities within other regionsEstablished from known non-CpG islands(Training Set)Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states+A C G TA.180 .274 .426 .120C.171 .368 .274 .188G.161 .339 .375 .125T.079 .355 .384 .182-A C G TA.300 .205 .285 .210C.233 .298 .078 .302G.248 .246 .298 .208T.177 .239 .292 .292= 1= 1= 1= 1= 1= 1= 1= 1Log Likehoods—Telling “CpG Island” from “Non-CpG Island”A C G TA-0.740 +0.419 +0.580 -0.803C-0.913 +0.302 +1.812 -0.685G-0.624 +0.461 +0.331 -0.730T-1.169 +0.573 +0.393 -0.679Another way to see effects of transitions:Log likelihoodsL(u, v) = log[ P(uv | + ) / P(uv | -) ]Given a region x = x1…xNA quick-&-dirty way to decide whether entire x is CpGP(x is CpG) > P(x is not CpG) i L(xi, xi+1) > 0A model of CpG Islands – (2) Transitions•What about transitions between (+) and (-) states?•They affect Avg. length of CpG islandAvg. separation between two CpG islandsX Y1-p1-qp qLength distribution of region X:P[lX = 1] = 1-pP[lX = 2] = p(1-p)…P[lX= k] = pk-1(1-p)E[lX] = 1/(1-p)Geometric distribution, with mean 1/(1-p)What if a new genome comes?•We just sequenced the porcupine genome•We know CpG islands play the same role in this genome•However, we have no known CpG islands for porcupines•We suspect the frequency and characteristics of CpG islands are quite different in porcupinesHow do we adjust the parameters in our model?LEARNINGLearningRe-estimate the parameters of the model based on training dataTwo learning scenarios1. Estimation when the “right answer” is knownExamples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls2. Estimation when the “right answer” is unknownExamples:GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he changes diceQUESTION: Update the parameters of the model to maximize P(x|)1. When the right answer is knownGiven x = x1…xNfor which the true = 1…N is known,Define:Akl = # times k l transition occurs in Ek(b) = # times state k in emits b in xWe can show that the maximum likelihood parameters (maximize P(x|)) are: Akl Ek(b)akl = ––––– ek(b) = ––––––– i Aki c Ek(c)1. When the right answer is knownIntuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training dataDrawback: Given little data, there may be overfitting:P(x|) is maximized, but is unreasonable0 probabilities – BADExample:Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 = F, F, F, F, F, F, F, F, F, FThen:aFF = 1; aFL = 0eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1PseudocountsSolution for small training sets:Add pseudocountsAkl = # times kl transition occurs in + rklEk(b) = # times state k in emits b in x + rk(b)rkl, rk(b) are pseudocounts representing our prior beliefLarger pseudocounts Strong priof beliefSmall pseudocounts ( < 1): just to avoid 0 probabilitiesPseudocountsExample: dishonest casinoWe will observe player for one day, 600 rolls Reasonable pseudocounts: r0F = r0L = rF0 = rL0 = 1;rFL = rLF = rFF = rLL = 1;rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair)rL(1) = rL(2) = … = rL(6) = 5 (wait and see for loaded)Above #s are arbitrary – assigning priors is an art2. When the right answer is unknownWe don’t know the true Akl, Ek(b)Idea:•We estimate our “best guess” on what Akl, Ek(b) areOr, we start with random / uniform values•We update the parameters of the model, based on our guess•We repeat2. When the right answer is unknownStarting with our best guess of a model M, parameters :Given x = x1…xNfor which the true = 1…N is unknown,We can get to a provably more likely parameter set i.e., that increases the probability P(x | )Principle: EXPECTATION MAXIMIZATION1. Estimate Akl, Ek(b) in the training data2. Update according to Akl, Ek(b)3. Repeat 1 & 2, until convergenceEstimating new parametersTo estimate Akl: (assume “| CURRENT”, in all formulas below)At each position i of sequence x, find probability transition kl is used:P(i = k, i+1 = l | x) = [1/P(x)] P(i = k, i+1 = l, x1…xN) = Q/P(x)where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1)So: P(i = k, i+1 = l
View Full Document