Stanford CS 262 - CpG islands in DNA sequences - D708547

Home> Schools> Stanford University> Computer Science (CS) > CS 262> CpG islands in DNA sequences

DOC PREVIEW

Stanford CS 262 - CpG islands in DNA sequences

School name Stanford University

Course Cs 262- Computational Genomics

Pages 35

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A modeling ExampleMethylation & SilencingExample: CpG IslandsSlide 4A model of CpG Islands – (1) ArchitectureA model of CpG Islands – (2) TransitionsLog Likehoods— Telling “CpG Island” from “Non-CpG Island”Slide 8What if a new genome comes?Problem 3: LearningTwo learning scenarios1. When the right answer is knownSlide 15Pseudocounts2. When the right answer is unknownSlide 19Estimating new parametersSlide 21The Baum-Welch AlgorithmSlide 23Alternative: Viterbi TrainingConditional Random FieldsLet’s look at an HMM againSlide 27Slide 28“Features” that depend on many pos. in xSlide 30Slide 31Slide 32How many parameters are there, in general?Conditional TrainingSlide 35Slide 36Slide 37Conditional Random Fields—SummaryCS262 Lecture 6, Win07, BatzoglouA+ C+ G+ T+A- C- G- T-A modeling ExampleCpG islands in DNA sequencesCS262 Lecture 6, Win07, BatzoglouMethylation & Silencing•One way cells differentiate is methylationAddition of CH3 in C-nucleotidesSilences genes in region•CG (denoted CpG) often mutates to TG, when methylated•In each cell, one copy of X is silenced, methylation plays role•Methylation is inherited during cell divisionCS262 Lecture 6, Win07, BatzoglouExample: CpG IslandsCpG nucleotides in the genome are frequently methylated(Write CpG not to confuse with CG base pair)C  methyl-C  TMethylation often suppressed around genes, promoters CpG islandsCS262 Lecture 6, Win07, BatzoglouExample: CpG Islands•In CpG islands, CG is more frequentOther pairs (AA, AG, AT…) have different frequenciesQuestion: Detect CpG islands computationallyCS262 Lecture 6, Win07, BatzoglouA model of CpG Islands – (1) ArchitectureA+ C+ G+ T+A- C- G- T-CpG IslandNot CpG IslandCS262 Lecture 6, Win07, BatzoglouA model of CpG Islands – (2) TransitionsHow do we estimate parameters of the model?Emission probabilities: 1/01. Transition probabilities within CpG islandsEstablished from known CpG islands (Training Set)2. Transition probabilities within other regionsEstablished from known non-CpG islands(Training Set)Note: these transitions out of each state add up to one—no room for transitions between (+) and (-) states+A C G TA.180 .274 .426 .120C.171 .368 .274 .188G.161 .339 .375 .125T.079 .355 .384 .182-A C G TA.300 .205 .285 .210C.233 .298 .078 .302G.248 .246 .298 .208T.177 .239 .292 .292= 1= 1= 1= 1= 1= 1= 1= 1CS262 Lecture 6, Win07, BatzoglouLog Likehoods—Telling “CpG Island” from “Non-CpG Island”A C G TA-0.740 +0.419 +0.580 -0.803C-0.913 +0.302 +1.812 -0.685G-0.624 +0.461 +0.331 -0.730T-1.169 +0.573 +0.393 -0.679Another way to see effects of transitions:Log likelihoodsL(u, v) = log[ P(uv | + ) / P(uv | -) ]Given a region x = x1…xNA quick-&-dirty way to decide whether entire x is CpGP(x is CpG) > P(x is not CpG) i L(xi, xi+1) > 0CS262 Lecture 6, Win07, BatzoglouA model of CpG Islands – (2) Transitions•What about transitions between (+) and (-) states?•They affect Avg. length of CpG islandAvg. separation between two CpG islandsX Y1-p1-qp qLength distribution of region X:P[lX = 1] = 1-pP[lX = 2] = p(1-p)…P[lX= k] = pk-1(1-p)E[lX] = 1/(1-p)Geometric distribution, with mean 1/(1-p)CS262 Lecture 6, Win07, BatzoglouWhat if a new genome comes?•We just sequenced the porcupine genome•We know CpG islands play the same role in this genome•However, we have no known CpG islands for porcupines•We suspect the frequency and characteristics of CpG islands are quite different in porcupinesHow do we adjust the parameters in our model?LEARNINGCS262 Lecture 6, Win07, BatzoglouProblem 3: LearningRe-estimate the parameters of the model based on training dataCS262 Lecture 6, Win07, BatzoglouTwo learning scenarios1. Estimation when the “right answer” is knownExamples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls2. Estimation when the “right answer” is unknownExamples:GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he changes diceQUESTION: Update the parameters  of the model to maximize P(x|)CS262 Lecture 6, Win07, Batzoglou1. When the right answer is knownGiven x = x1…xNfor which the true  = 1…N is known,Define:Akl = # times k l transition occurs in Ek(b) = # times state k in  emits b in xWe can show that the maximum likelihood parameters  (maximize P(x|)) are: Akl Ek(b)akl = ––––– ek(b) = ––––––– i Aki c Ek(c)CS262 Lecture 6, Win07, Batzoglou1. When the right answer is knownIntuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training dataDrawback: Given little data, there may be overfitting:P(x|) is maximized, but  is unreasonable0 probabilities – BADExample:Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 = F, F, F, F, F, F, F, F, F, FThen:aFF = 1; aFL = 0eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1CS262 Lecture 6, Win07, BatzoglouPseudocountsSolution for small training sets:Add pseudocountsAkl = # times kl transition occurs in  + rklEk(b) = # times state k in  emits b in x + rk(b)rkl, rk(b) are pseudocounts representing our prior beliefLarger pseudocounts  Strong priof beliefSmall pseudocounts ( < 1): just to avoid 0 probabilitiesCS262 Lecture 6, Win07, Batzoglou2. When the right answer is unknownWe don’t know the true Akl, Ek(b)Idea:•We estimate our “best guess” on what Akl, Ek(b) areOr, we start with random / uniform values•We update the parameters of the model, based on our guess•We repeatCS262 Lecture 6, Win07, Batzoglou2. When the right answer is unknownStarting with our best guess of a model M, parameters :Given x = x1…xNfor which the true  = 1…N is unknown,We can get to a provably more likely parameter set i.e.,  that increases the probability P(x | )Principle: EXPECTATION MAXIMIZATION1. Estimate Akl, Ek(b) in the training data2. Update  according to Akl, Ek(b)3. Repeat 1 & 2, until convergenceCS262 Lecture 6, Win07, BatzoglouEstimating new parametersTo estimate Akl: (assume “| CURRENT”, in all formulas

View Full Document