CMU CS 10810 - Computational motif discovery - D2605418

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10810> Computational motif discovery

DOC PREVIEW

CMU CS 10810 - Computational motif discovery

School name Carnegie Mellon University

Course Cs 10810- Computational Genomics

Pages 60

This preview shows page 1-2-3-4-28-29-30-31-57-58-59-60 out of 60 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 60 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

GraduateGraduate ComputationalComputationalGenomicsGenomics02-710 / 10-81002-710 / 10-810 & MSCBIO2070& MSCBIO2070Computational motif discoveryComputational motif discoveryTakis BenosTakis BenosLecture #11, February 20, 2007Lecture #11, February 20, 2007Reading: Reading: handouts & papershandouts & papersBenos 02-710/MSCBIO2070 20-FEB-2007 2OutlineOutline The problem Pattern representation Pattern matching• Pattern discovery• A greedy algorithm• Expectation-Maximization• Gibbs sampling• Self-organizing mapsBenos 02-710/MSCBIO2070 20-FEB-2007 3Time-course microarray dataTime-course microarray dataSlide courtesy: Jason ErnstBenos 02-710/MSCBIO2070 20-FEB-2007 4De novoDe novo motif finding motif findingGIVEN• A set of unaligned sequences that contain instances of a motif• The expected motif length• A set of background sequences• An objective functionFIND• The motif that maximizes the objective functionBenos 02-710/MSCBIO2070 20-FEB-2007 5Motif finding in unalignedMotif finding in unalignedsequencessequencesBenos 02-710/MSCBIO2070 20-FEB-2007 6• Objective function• Relative entropy• Optimization method• Greedy algorithm (brute force)CONSENSUS, CONSENSUS, WConsensusWConsensus! RHPSSM= fb(i) " lnfb(i)Pref(b)b= AT#i=1N#Benos 02-710/MSCBIO2070 20-FEB-2007 7CONSENSUS, CONSENSUS, WConsensusWConsensus• No need for positive set (background estimated from the genome).• In each round only a number of matrices is retained.• For CONSENSUS, the user specifies the length of the pattern; forWConsensus the user -alternatively- determines the number of SD.• Programs can predict zero or more target sites in the examined promoters.• WConsensus permits p-value calculations of a PSSM.• Testing• 18 CRP-regulated genes• 105 bp long promoters• 24 putative CRP binding sites• CRP binds as dimer• WConsensus: predicted 19 of the 24 (Sn=79%) plus 3 additional sites(Sp=89%).Benos 02-710/MSCBIO2070 20-FEB-2007 8Product multinomial (PM)Product multinomial (PM)modelmodel [Lawrence [Lawrence et al.et al., , ScienceScience,, 1993]1993] Position-specific multinomial distribution: θi = [θiA, θiC, θiG, θiT]Tθ1Α1θ2Α2… θLΑLT TTGATCGT TTGCACGT GTGAGCAT GCAAAGG Position-Specific Scoring Matrix (PSSM): θ Assumes nucleotide distribution in one position is independent ofdistributions in other positionsBenos 02-710/MSCBIO2070 20-FEB-2007 9Alternating approachAlternating approach1. Guess an initial weight matrix2. Use weight matrix to predict instances in the input sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until no change.Examples: MEME (expectation maximization / Bailey, Elkan)Gibbs sampler (Lawrence et al.)ANN-Spec (neural net / Workman, Stormo)10Expectation - MaximizationExpectation - MaximizationBenos 02-710/MSCBIO2070 20-FEB-2007 11Expectation-maximizationExpectation-maximizationforeach subsequence of width Wconvert subsequence to a matrixdo { re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences} until (matrix model stops changing)endselect matrix with highest scoreEMBenos 02-710/MSCBIO2070 20-FEB-2007 12Expectation-maximizationExpectation-maximizationDNA “signal”TGACCTCTTGATCTTAGGACCCTATGATCCGTTGACCCTTGGACCCTTTGACCTCTTGACCTTAPSSMA -1.1 -1.1 +1.1 -1.1 -1.1 -1.1 -1.1 +.29C -1.1 -1.1 -1.1 + .85 +1.1 +.51 0.0 -1.1G 0.0 +1.1 -1.1 -1.1 -1.1 -1.1 -.40 -1.1T +.85 -1.1 -1.1 0.0 -1.1 +.51 +.69 +.69Benos 02-710/MSCBIO2070 20-FEB-2007 13Expectation-maximizationExpectation-maximizationDNA “signal”TGACCTTTTGATCTTATGACCCTATGATCCGTTGACTCTTGGACCCTTTGACCTCTTGACCTTAPSSM (2)A -1.1 -1.1 +1.1 -1.1 -1.1 -1.1 -1.1 +.29C -1.1 -1.1 -1.1 + .85 +1.1 +.51 -.40 -1.1G -.40 +1.1 -1.1 -1.1 -1.1 -1.1 -.40 -1.1T +.98 -1.1 -1.1 +0.0 -1.1 +.51 +.85 +.69Benos 02-710/MSCBIO2070 20-FEB-2007 14Scoring each subsequenceScoring each subsequenceSubsequences ScoreTGTGACCT -0.42 GTGACCTG -2.34 TGCTGGTT +5.93 GCTGGTTT -2.42 ...Sequence: TGTGACCTTTTTTGTGGCATCGGGCGAGAATASelect from each sequence the subsequence with maximal score.Benos 02-710/MSCBIO2070 20-FEB-2007 15Re-estimating motif matrixRe-estimating motif matrixOccurrencesTTTGATCGTTTGCACGTGTGAGCATGCAAAGGCountsA 00013201C 00101030G 02030113T 42300100+ pseudoA 11124312C 11212141G 13141224T 53411211A 0.13 0.13 0.13 0.25 0.50 ...C 0.13 0.13 0.25 0.13 0.25G 0.13 0.38 0.13 0.50 0.13T 0.63 0.38 0.50 0.13 0.13Convert to frequenciesBenos 02-710/MSCBIO2070 20-FEB-2007 16•Problem: How do we estimate counts accurately when we haveonly a few examples?•Solution: Use Dirichlet mixture priors.•Problem: Too many possible starting points.•Solution: Save time by running only one iteration of EM.•Problem: Too many possible widths.•Solution: Consider widths that vary by √2 and adjust motifs afterwards.•Problem: The EM algorithm finds only one motif.•Solution: Probabilistically erase the motif from the data set, and repeat.Possible problemsPossible problemsBenos 02-710/MSCBIO2070 20-FEB-2007 17• Problem: The motif model is too simplistic.• Solution: Use a two-component mixture model that captures the backgrounddistribution. Allow the background model to be more complex.• Problem: The EM algorithm does not tell you how many motifsthere are.• Solution: Compute statistical significance of motifs and stop when they areno longer significant.• Problem:• This procedure doesn't allow the motifs to move around very much. Takingthe max is too brittle.• Solution:• Associate with each start site a probability of motif occurrence.Possible problems (Possible problems (cntdcntd))Benos 02-710/MSCBIO2070 20-FEB-2007 18WhatWhat’’s the idea behind it?s the idea behind it? Assume we have a set of observed quantities x (e.g., bindingsites), determined from a set of missing or hidden data y (e.g.,PSSM model). Assume we have set of estimated parameters θt. We want to calculate a new θt+1 with: We define/calculate:! log P(x |"t +1) > logP(x |"t)! Q("|"t) = P(y | x,"t) # log P(x, y |")y$E-stepBenos 02-710/MSCBIO2070 20-FEB-2007 19WhatWhat’’s the idea behind it?s the idea behind it?So, choose:! log P(x |") = log P(x, y |") # log P(y | x,")! P(x, y |") = P(y | x,") # P(x |") $! = P(y | x,"t) # log P(x, y |")y$% P(y | x,"t) # log P(y | x,")y$= Q("|"t) % P(y | x,"t) # log P(y | x,")y$&! = Q("|"t) # Q("t|"t) + P(y | x,"t) $ logP(y

View Full Document