GraduateGraduate ComputationalComputationalGenomicsGenomics02-710 / 10-81002-710 / 10-810 & MSCBIO2070& MSCBIO2070Computational motif discoveryComputational motif discoveryTakis BenosTakis BenosLecture #11, February 20, 2007Lecture #11, February 20, 2007Reading: Reading: handouts & papershandouts & papersBenos 02-710/MSCBIO2070 20-FEB-2007 2OutlineOutline The problem Pattern representation Pattern matching• Pattern discovery• A greedy algorithm• Expectation-Maximization• Gibbs sampling• Self-organizing mapsBenos 02-710/MSCBIO2070 20-FEB-2007 3Time-course microarray dataTime-course microarray dataSlide courtesy: Jason ErnstBenos 02-710/MSCBIO2070 20-FEB-2007 4De novoDe novo motif finding motif findingGIVEN• A set of unaligned sequences that contain instances of a motif• The expected motif length• A set of background sequences• An objective functionFIND• The motif that maximizes the objective functionBenos 02-710/MSCBIO2070 20-FEB-2007 5Motif finding in unalignedMotif finding in unalignedsequencessequencesBenos 02-710/MSCBIO2070 20-FEB-2007 6• Objective function• Relative entropy• Optimization method• Greedy algorithm (brute force)CONSENSUS, CONSENSUS, WConsensusWConsensus! RHPSSM= fb(i) " lnfb(i)Pref(b)b= AT#i=1N#Benos 02-710/MSCBIO2070 20-FEB-2007 7CONSENSUS, CONSENSUS, WConsensusWConsensus• No need for positive set (background estimated from the genome).• In each round only a number of matrices is retained.• For CONSENSUS, the user specifies the length of the pattern; forWConsensus the user -alternatively- determines the number of SD.• Programs can predict zero or more target sites in the examined promoters.• WConsensus permits p-value calculations of a PSSM.• Testing• 18 CRP-regulated genes• 105 bp long promoters• 24 putative CRP binding sites• CRP binds as dimer• WConsensus: predicted 19 of the 24 (Sn=79%) plus 3 additional sites(Sp=89%).Benos 02-710/MSCBIO2070 20-FEB-2007 8Product multinomial (PM)Product multinomial (PM)modelmodel [Lawrence [Lawrence et al.et al., , ScienceScience,, 1993]1993] Position-specific multinomial distribution: θi = [θiA, θiC, θiG, θiT]Tθ1Α1θ2Α2… θLΑLT TTGATCGT TTGCACGT GTGAGCAT GCAAAGG Position-Specific Scoring Matrix (PSSM): θ Assumes nucleotide distribution in one position is independent ofdistributions in other positionsBenos 02-710/MSCBIO2070 20-FEB-2007 9Alternating approachAlternating approach1. Guess an initial weight matrix2. Use weight matrix to predict instances in the input sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until no change.Examples: MEME (expectation maximization / Bailey, Elkan)Gibbs sampler (Lawrence et al.)ANN-Spec (neural net / Workman, Stormo)10Expectation - MaximizationExpectation - MaximizationBenos 02-710/MSCBIO2070 20-FEB-2007 11Expectation-maximizationExpectation-maximizationforeach subsequence of width Wconvert subsequence to a matrixdo { re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences} until (matrix model stops changing)endselect matrix with highest scoreEMBenos 02-710/MSCBIO2070 20-FEB-2007 12Expectation-maximizationExpectation-maximizationDNA “signal”TGACCTCTTGATCTTAGGACCCTATGATCCGTTGACCCTTGGACCCTTTGACCTCTTGACCTTAPSSMA -1.1 -1.1 +1.1 -1.1 -1.1 -1.1 -1.1 +.29C -1.1 -1.1 -1.1 + .85 +1.1 +.51 0.0 -1.1G 0.0 +1.1 -1.1 -1.1 -1.1 -1.1 -.40 -1.1T +.85 -1.1 -1.1 0.0 -1.1 +.51 +.69 +.69Benos 02-710/MSCBIO2070 20-FEB-2007 13Expectation-maximizationExpectation-maximizationDNA “signal”TGACCTTTTGATCTTATGACCCTATGATCCGTTGACTCTTGGACCCTTTGACCTCTTGACCTTAPSSM (2)A -1.1 -1.1 +1.1 -1.1 -1.1 -1.1 -1.1 +.29C -1.1 -1.1 -1.1 + .85 +1.1 +.51 -.40 -1.1G -.40 +1.1 -1.1 -1.1 -1.1 -1.1 -.40 -1.1T +.98 -1.1 -1.1 +0.0 -1.1 +.51 +.85 +.69Benos 02-710/MSCBIO2070 20-FEB-2007 14Scoring each subsequenceScoring each subsequenceSubsequences ScoreTGTGACCT -0.42 GTGACCTG -2.34 TGCTGGTT +5.93 GCTGGTTT -2.42 ...Sequence: TGTGACCTTTTTTGTGGCATCGGGCGAGAATASelect from each sequence the subsequence with maximal score.Benos 02-710/MSCBIO2070 20-FEB-2007 15Re-estimating motif matrixRe-estimating motif matrixOccurrencesTTTGATCGTTTGCACGTGTGAGCATGCAAAGGCountsA 00013201C 00101030G 02030113T 42300100+ pseudoA 11124312C 11212141G 13141224T 53411211A 0.13 0.13 0.13 0.25 0.50 ...C 0.13 0.13 0.25 0.13 0.25G 0.13 0.38 0.13 0.50 0.13T 0.63 0.38 0.50 0.13 0.13Convert to frequenciesBenos 02-710/MSCBIO2070 20-FEB-2007 16•Problem: How do we estimate counts accurately when we haveonly a few examples?•Solution: Use Dirichlet mixture priors.•Problem: Too many possible starting points.•Solution: Save time by running only one iteration of EM.•Problem: Too many possible widths.•Solution: Consider widths that vary by √2 and adjust motifs afterwards.•Problem: The EM algorithm finds only one motif.•Solution: Probabilistically erase the motif from the data set, and repeat.Possible problemsPossible problemsBenos 02-710/MSCBIO2070 20-FEB-2007 17• Problem: The motif model is too simplistic.• Solution: Use a two-component mixture model that captures the backgrounddistribution. Allow the background model to be more complex.• Problem: The EM algorithm does not tell you how many motifsthere are.• Solution: Compute statistical significance of motifs and stop when they areno longer significant.• Problem:• This procedure doesn't allow the motifs to move around very much. Takingthe max is too brittle.• Solution:• Associate with each start site a probability of motif occurrence.Possible problems (Possible problems (cntdcntd))Benos 02-710/MSCBIO2070 20-FEB-2007 18WhatWhat’’s the idea behind it?s the idea behind it? Assume we have a set of observed quantities x (e.g., bindingsites), determined from a set of missing or hidden data y (e.g.,PSSM model). Assume we have set of estimated parameters θt. We want to calculate a new θt+1 with: We define/calculate:! log P(x |"t +1) > logP(x |"t)! Q("|"t) = P(y | x,"t) # log P(x, y |")y$E-stepBenos 02-710/MSCBIO2070 20-FEB-2007 19WhatWhat’’s the idea behind it?s the idea behind it?So, choose:! log P(x |") = log P(x, y |") # log P(y | x,")! P(x, y |") = P(y | x,") # P(x |") $! = P(y | x,"t) # log P(x, y |")y$% P(y | x,"t) # log P(y | x,")y$= Q("|"t) % P(y | x,"t) # log P(y | x,")y$&! = Q("|"t) # Q("t|"t) + P(y | x,"t) $ logP(y
View Full Document