A modeling ExampleMethylation & SilencingExample: CpG IslandsSlide 4A model of CpG Islands – (1) ArchitectureA model of CpG Islands – (2) TransitionsLog Likehoods— Telling “CpG Island” from “Non-CpG Island”Slide 8What if a new genome comes?Problem 3: LearningTwo learning scenarios1. When the right answer is knownSlide 15Pseudocounts2. When the right answer is unknownSlide 19Estimating new parametersSlide 21The Baum-Welch AlgorithmSlide 23Alternative: Viterbi TrainingConditional Random FieldsLet’s look at an HMM againSlide 27Slide 28“Features” that depend on many pos. in xSlide 30Slide 31Slide 32How many parameters are there, in general?Conditional TrainingSlide 35Slide 36Slide 37Conditional Random Fields—SummaryCS262 Lecture 6, Win07, BatzoglouA+ C+ G+ T+A- C- G- T-A modeling ExampleCpG islands in DNA sequencesCS262 Lecture 6, Win07, BatzoglouMethylation & Silencing•One way cells differentiate is methylationAddition of CH3 in C-nucleotidesSilences genes in region•CG (denoted CpG) often mutates to TG, when methylated•In each cell, one copy of X is silenced, methylation plays role•Methylation is inherited during cell divisionCS262 Lecture 6, Win07, BatzoglouExample: CpG IslandsCpG nucleotides in the genome are frequently methylated(Write CpG not to confuse with CG base pair)C methyl-C TMethylation often suppressed around genes, promoters CpG islandsCS262 Lecture 6, Win07, BatzoglouExample: CpG Islands•In CpG islands, CG is more frequentOther pairs (AA, AG, AT…) have different frequenciesQuestion: Detect CpG islands computationallyCS262 Lecture 6, Win07, BatzoglouA model of CpG Islands – (1) ArchitectureA+ C+ G+ T+A- C- G- T-CpG IslandNot CpG IslandCS262 Lecture 6, Win07, BatzoglouA model of CpG Islands – (2) TransitionsHow do we estimate parameters of the model?Emission probabilities: 1/01. Transition probabilities within CpG islandsEstablished from known CpG islands (Training Set)2. Transition probabilities within other regionsEstablished from known non-CpG islands(Training Set)Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states+A C G TA.180 .274 .426 .120C.171 .368 .274 .188G.161 .339 .375 .125T.079 .355 .384 .182-A C G TA.300 .205 .285 .210C.233 .298 .078 .302G.248 .246 .298 .208T.177 .239 .292 .292= 1= 1= 1= 1= 1= 1= 1= 1CS262 Lecture 6, Win07, BatzoglouLog Likehoods—Telling “CpG Island” from “Non-CpG Island”A C G TA-0.740 +0.419 +0.580 -0.803C-0.913 +0.302 +1.812 -0.685G-0.624 +0.461 +0.331 -0.730T-1.169 +0.573 +0.393 -0.679Another way to see effects of transitions:Log likelihoodsL(u, v) = log[ P(uv | + ) / P(uv | -) ]Given a region x = x1…xNA quick-&-dirty way to decide whether entire x is CpGP(x is CpG) > P(x is not CpG) i L(xi, xi+1) > 0CS262 Lecture 6, Win07, BatzoglouA model of CpG Islands – (2) Transitions•What about transitions between (+) and (-) states?•They affect Avg. length of CpG islandAvg. separation between two CpG islandsX Y1-p1-qp qLength distribution of region X:P[lX = 1] = 1-pP[lX = 2] = p(1-p)…P[lX= k] = pk-1(1-p)E[lX] = 1/(1-p)Geometric distribution, with mean 1/(1-p)CS262 Lecture 6, Win07, BatzoglouWhat if a new genome comes?•We just sequenced the porcupine genome•We know CpG islands play the same role in this genome•However, we have no known CpG islands for porcupines•We suspect the frequency and characteristics of CpG islands are quite different in porcupinesHow do we adjust the parameters in our model?LEARNINGCS262 Lecture 6, Win07, BatzoglouProblem 3: LearningRe-estimate the parameters of the model based on training dataCS262 Lecture 6, Win07, BatzoglouTwo learning scenarios1. Estimation when the “right answer” is knownExamples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls2. Estimation when the “right answer” is unknownExamples:GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he changes diceQUESTION: Update the parameters of the model to maximize P(x|)CS262 Lecture 6, Win07, Batzoglou1. When the right answer is knownGiven x = x1…xNfor which the true = 1…N is known,Define:Akl = # times k l transition occurs in Ek(b) = # times state k in emits b in xWe can show that the maximum likelihood parameters (maximize P(x|)) are: Akl Ek(b)akl = ––––– ek(b) = ––––––– i Aki c Ek(c)CS262 Lecture 6, Win07, Batzoglou1. When the right answer is knownIntuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training dataDrawback: Given little data, there may be overfitting:P(x|) is maximized, but is unreasonable0 probabilities – BADExample:Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 = F, F, F, F, F, F, F, F, F, FThen:aFF = 1; aFL = 0eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1CS262 Lecture 6, Win07, BatzoglouPseudocountsSolution for small training sets:Add pseudocountsAkl = # times kl transition occurs in + rklEk(b) = # times state k in emits b in x + rk(b)rkl, rk(b) are pseudocounts representing our prior beliefLarger pseudocounts Strong priof beliefSmall pseudocounts ( < 1): just to avoid 0 probabilitiesCS262 Lecture 6, Win07, Batzoglou2. When the right answer is unknownWe don’t know the true Akl, Ek(b)Idea:•We estimate our “best guess” on what Akl, Ek(b) areOr, we start with random / uniform values•We update the parameters of the model, based on our guess•We repeatCS262 Lecture 6, Win07, Batzoglou2. When the right answer is unknownStarting with our best guess of a model M, parameters :Given x = x1…xNfor which the true = 1…N is unknown,We can get to a provably more likely parameter set i.e., that increases the probability P(x | )Principle: EXPECTATION MAXIMIZATION1. Estimate Akl, Ek(b) in the training data2. Update according to Akl, Ek(b)3. Repeat 1 & 2, until convergenceCS262 Lecture 6, Win07, BatzoglouEstimating new parametersTo estimate Akl: (assume “| CURRENT”, in all formulas
View Full Document