(Regulatory-) Motif FindingClustering of GenesChromatin ImmunoprecipitationFinding Regulatory MotifsCharacteristics of Regulatory MotifsEssentially a Multiple Local AlignmentAlgorithmsCombinatorial Approaches to Motif FindingDiscrete FormulationsApproachesExhaustive SearchesSlide 12MULTIPROFILERSlide 14Slide 15CONSENSUSSlide 17Slide 18Expectation Maximization in Motif FindingExpectation MaximizationSlide 21Slide 22Slide 23Slide 24Expectation Maximization: E-stepExpectation Maximization: M-stepSlide 27Initial Parameters Matter!Overview of EM AlgorithmSlide 30(Regulatory-) Motif Finding (Regulatory-) Motif FindingClustering of GenesFind binding sites responsible for common expression patternsChromatin ImmunoprecipitationMicroarray Chip“ChIP-chip”Finding Regulatory MotifsGiven a collection of genes with common expression,Find the TF-binding motif in common...Characteristics of Regulatory Motifs•Tiny•Highly Variable•~Constant SizeBecause a constant-size transcription factor binds•Often repeated•Low-complexity-ishEssentially a Multiple Local Alignment•Find “best” multiple local alignment•Alignment score defined differently in probabilistic/combinatorial cases...Algorithms•CombinatorialCONSENSUS, TEIRESIAS, SP-STAR, others•Probabilistic1. Expectation Maximization:MEME2. Gibbs Sampling: AlignACE, BioProspectorCombinatorial Approaches to Motif FindingDiscrete FormulationsGiven sequences S = {x1, …, xn}•A motif W is a consensus string w1…wK•Find motif W* with “best” match to x1, …, xnDefinition of “best”:d(W, xi) = min hamming dist. between W and any word in xid(W, S) = i d(W, xi)Approaches•Exhaustive Searches•CONSENSUS•MULTIPROFILER•TEIRESIAS, SP-STAR, WINNOWERExhaustive Searches1. Pattern-driven algorithm:For W = AA…A to TT…T (4K possibilities)Find d( W, S )Report W* = argmin( d(W, S) )Running time: O( K N 4K )(where N = i |xi|)Advantage: Finds provably “best” motif WDisadvantage: TimeExhaustive Searches2. Sample-driven algorithm:For W = any K-long word occurring in some xiFind d( W, S )Report W* = argmin( d( W, S ) )or, Report a local improvement of W*Running time: O( K N2 )Advantage: TimeDisadvantage: If the true motif is weak and does not occur in datathen a random motif may score better than any instance of true motifMULTIPROFILER•Extended sample-driven approachGiven a K-long word W, define: Nα(W) = words W’ in S s.t. d(W,W’) αIdea: Assume W is occurrence of true motif W*Will use Nα(W) to correct “errors” in WMULTIPROFILERAssume W differs from true motif W* in at most L positionsDefine: A wordlet G of W is a L-long pattern with blanks, differing from WL is smaller than the word length KExample: K = 7; L = 3W = ACGTTGAG = --A--CGMULTIPROFILERAlgorithm:For each W in S:For L = 1 to Lmax1. Find the α-neighbors of W in S Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,1. Modify W by the wordlet G W’2. Compute d(W’, S)Report W* = argmin d(W’, S)Step 1 above: Smaller motif-finding problem; Use exhaustive searchCONSENSUSAlgorithm:Cycle 1:For each word W in S (of fixed length!)For each word W’ in SCreate alignment (gap free) of W, W’Keep the C1 best alignments, A1, …, AC1ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …CONSENSUSAlgorithm:Cycle t:For each word W in SFor each alignment Aj from cycle t-1Create alignment (gap free) of W, AjKeep the Cl best alignments A1, …, ACtACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …CONSENSUS•C1, …, Cn are user-defined heuristic constantsN is sum of sequence lengthsn is the number of sequencesRunning time: O(N2) + O(N C1) + O(N C2) + … + O(N Cn)= O( N2 + NCtotal)Where Ctotal = i Ci, typically O(nC), where C is a big constantExpectation Maximization in Motif FindingExpectation Maximization in Motif FindingAll K-long wordsmotifbackgroundExpectation MaximizationAlgorithm (sketch):1. Given genomic sequences find all K-long words2. Assume each word is motif or background3. Find likeliest Motif ModelBackground Modelclassification of words into either Motif or BackgroundExpectation Maximization•Given sequences x1, …, xN,•Find all k-long words X1,…, Xn•Define motif model: M = (M1,…, MK)Mi = (Mi1,…, Mi4) (assume {A, C, G, T})where Mij = Prob[ letter j occurs in motif position i ]•Define background model:B = B1, …, B4Bi = Prob[ letter j in background sequence ]motifbackgroundACGTM1MKM1BExpectation Maximization•Define Zi1 = { 1, if Xi is motif; 0, otherwise }Zi2 = { 0, if Xi is motif; 1, otherwise }•Given a word Xi = x[1]…x[k],P[ Xi, Zi1=1 ] = M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K]Let 1 = ; 2 = (1- )motifbackgroundACGTM1MKM1B 1 – Expectation MaximizationDefine:Parameter space = (M,B) 1: Motif; 2: BackgroundObjective:Maximize log likelihood of model: 2121111211log)|(log))|(log(),|,...(logj jjijnijiijnini jjijijnZZZXPXPZXXPACGTM1MKM1B 1 – Expectation Maximization•Maximize expected likelihood, in iteration of two steps:Expectation:Find expected value of log likelihood:Maximization:Maximize expected value over , )],|,...([log1ZXXPEnExpectation:Find expected value of log likelihood: 2121111log][)|(log][)],|,...([logj jjijnijiijninZZEXPEZXXPEwhere expected values of Z can be computed as follows:ijiijijijZXPXPXPZE *)|()1()|()|(][21Expectation Maximization: E-stepExpectation Maximization: M-stepMaximization:Maximize expected value over and independentlyFor , this is easy: niniiiiNEWnZxamZZ1 1121*))1log(log(arg**•For = (M, B), definecjk = E[ # times letter k appears in motif position j]c0k = E[ # times letter k appears in background]•cij values are calculated easily from Z* valuesIt easily follows:41kjkjkNEWjkccM4100kkkNEWkccBto not allow any 0’s, add pseudocountsExpectation Maximization: M-stepInitial Parameters Matter!Consider the following “artificial” example:x1, …, xN contain:212 patterns on {A, T}: A…A, A…AT,……, T… T212 patterns on {C, G}: C…C , C…CG,…… , G…GD << 212 occurrences of 12-mer ACTGACTGACTGSome local maxima: ½; B = ½C, ½G; Mi = ½A, ½T, i
View Full Document