1Motif FindingMotif FindingRegulation of GenesGeneRegulatory ElementRNA polymerase(Protein)Transcription Factor(Protein)DNARegulation of GenesGeneRNA polymeraseTranscription Factor(Protein)Regulatory ElementDNARegulation of GenesGeneRNA polymeraseTranscription FactorRegulatory ElementDNANew proteinMicroarrays• A 2D array of DNA sequences from thousands of genes• Each spot has many copies of same gene• Allow mRNAs from a sample to hybridize• Measure number of hybridizations per spotFinding Regulatory MotifsFinding Regulatory MotifsTiny Multiple Local Alignments of Many Sequences2Finding Regulatory MotifsGivena collection of genes with common expression,Findthe TF-binding motif in common...Characteristics of Regulatory Motifs• Tiny• Highly Variable• ~Constant Size– Because a constant-size transcription factor binds• Often repeated• Low-complexity-ishProblem DefinitionProbabilisticMotif: Mij; 1 ≤ i ≤ W1 ≤ j ≤ 4Mij= Prob[ letter j, pos i ]Find best M, and positions p1,…, pNin sequencesCombinatorialMotif M: m1…mWSome of the mi’s blankFind M that occurs in all siwith ≤ k differencesGiven a collection of promoter sequences s1,…, sNof genes with common expressionEssentially a Multiple Local Alignment• Find“best” multiple local alignmentAlignment score defined differently in probabilistic/combinatorial cases...Algorithms• Probabilistic1. Expectation Maximization:MEME2. Gibbs Sampling: AlignACE, BioProspector• CombinatorialCONSENSUS, TEIRESIAS, SP-STAR, othersDiscrete Approaches to Motif Finding3Discrete FormulationsGiven sequences S = {x1, …, xn}• A motif W is a consensus string w1…wK• Find motif W*with “best” match to x1, …, xnDefinition of “best”:d(W, xi) = min hamming dist. between W and a word in xid(W, S) = Σid(W, xi)Approaches• Exhaustive Searches• CONSENSUS• MULTIPROFILER, TEIRESIAS, SP-STAR, WINNOWERExhaustive Searches1. Pattern-driven algorithm:For W = AA…A to TT…T (4Kpossibilities)Find d( W, S )Report W* = argmin( d(W, S) )Running time: O( K N 4K)(where N = Σi|xi|)Advantage: Finds provably best motif WDisadvantage: TimeExhaustive Searches2. Sample-driven algorithm:For W = a K-long word in some xiFind d( W, S )Report W* = argmin( d( W, S ) )OR Report a local improvement of W*Running time: O( K N2)Advantage: TimeDisadvantage: If: True motif does not occur in data, andTrue motif is “weak”Then, random motif may score better than any instance of true motifCONSENSUS (1)Algorithm:Cycle 1:For each word W in SFor each word W’ in SCreate alignment (gap free) of W, W’Keep the C1best alignments, A1, …, AC1ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …CONSENSUS (2)Algorithm (cont’d):Cycle t:For each word W in SFor each alignment Ajfrom cycle t-1Create alignment (gap free) of W, AjKeep the Clbest alignments A1, …, ActACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …4CONSENSUS (3)C1, …, Cnare user-defined heuristic constantsRunning time:O(N2) + O(N C1) + O(N C2) + … + O(N Cn)= O( N2+ NCtotal)Where Ctotal= ΣiCi, typically O(nC), where C is a big constantMULTIPROFILER• Extended sample-driven approachGiven a K-long word W, define: Na(W) = words W’ in S s.t. d(W,W’) ≤ aIdea:Assume W is occurrence of true motif W*Will use Na(W) to correct “errors” in WMULTIPROFILER (2)Assume W differs from true motif W*in at most L positionsDefine: A wordletG of W is a L-long pattern with blanks, differing from WExample: K = 7; L = 3W = ACGTTGAG = --A--CGMULTIPROFILER (2)Algorithm:For each W in S:For L = 1 to Lmax1. Find all “strong” L-long wordlets G in Na(W)2. Modify W by the wordlet G -> W’3. Compute d(W’, S)Report W*= argmin d(W’, S)Step 1 above: Smaller motif-finding problem; Use exhaustive searchExpectation Maximization in Motif FindingExpectation Maximization in Motif FindingExpectation Maximization (1)• The MM algorithm, part of MEME package uses Expectation MaximizationAlgorithm (sketch):1. Given genomic sequences find all K-long words2. Assume each word is motif or background3. Find likeliestMotif ModelBackground Modelclassification of words into either Motif or Background5Expectation Maximization (2)• Given sequences x1, …, xN,• Find all k-long words X1,…, Xn• Define motif model: M = (M1,…, MK)Mi= (Mi1,…, Mi4) (assume {A, C, G, T})where Mij= Prob[ motif position i is letter j ]• Define background model:B = B1, …, B4Bi= Prob[ letter j in background sequence ]Expectation Maximization (3)• Define Zi1= { 1, if Xiis motif;0, otherwise }Zi2= { 0, if Xiis motif;1, otherwise }• Given a word Xi= x[1]…x[k],P[ Xi, Zi1=1 ] = λ M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - λ) Bx[1]…Bx[K]Let λ1= λ; λ2= (1- λ)Expectation Maximization (4)Define:Parameter space θ = (M,B) Objective:Maximize log likelihood of model:∑ ∑∑∑∑∑= ==== =+==2121111211log)|(log))|(log(),|,...(logj jjijnijiijnini jjijijnZZZXPXPZXXPλθθλλθExpectation Maximization (5)• Maximize expected likelihood, in iteration of two steps:Expectation:Find expected value of log likelihood:Maximization:Maximize expected value over θ, λ)],|,...([log1λθZXXPEnExpectation Maximization (6): EE--stepstepExpectation:Find expected value of log likelihood:∑ ∑∑∑= ===+=2121111log][)|(log][)],|,...([logj jjijnijiijninZZEXPEZXXPEλθλθwhere expected values of Z can be computed as follows:∑==21)|()|(kkikjijijXPXPZθλθλExpectation Maximization (7): MM--stepstepMaximization:Maximize expected value over θ and λ independentlyFor λ, this is easy:∑ ∑= ===niniijjijNEWjnZExamZj1 1log][arg λλλ6Expectation Maximization (8): MM--stepstep• For θ = (M, B), definecjk= E[ # times letter k appears in motif position j]c0k= E[ # times letter k appears in background]It easily follows:∑==41kjkjkNEWjkccM∑==4100kkkNEWkccBto not allow any 0’s, add pseudocountsInitial Parameters Matter!Consider the following “artificial” example:x1, …, xNcontain:– 2Kpatterns A…A, A…AT,……, T…T– 2Kpatterns C…C , C…CG,…… , G…G– D << 2Koccurrences of K-mer ACTG…ACTGSome local maxima:λ ≈ ½; B = ½C, ½G; Mi= ½A, ½T, i = 1,…, Kλ ≈ D/2k+1; B = ¼A,¼C,¼G,¼T; M1= 100% A, M2= 100% C, M3= 100% T, etc.Overview of EM Algorithm1. Initialize parameters θ = (M, B), λ:– Try different values of λ from N-1/2upto 1/(2K)2. Repeat:a. Expectationb. Maximization3. Until change in θ = (M,
View Full Document