Stanford CS 262 - Motif Finding - D334859

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Motif Finding

DOC PREVIEW

Stanford CS 262 - Motif Finding

School name Stanford University

Course Cs 262- Computational Genomics

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Motif FindingMotif FindingRegulation of GenesGeneRegulatory ElementRNA polymerase(Protein)Transcription Factor(Protein)DNARegulation of GenesGeneRNA polymeraseTranscription Factor(Protein)Regulatory ElementDNARegulation of GenesGeneRNA polymeraseTranscription FactorRegulatory ElementDNANew proteinMicroarrays• A 2D array of DNA sequences from thousands of genes• Each spot has many copies of same gene• Allow mRNAs from a sample to hybridize• Measure number of hybridizations per spotFinding Regulatory MotifsFinding Regulatory MotifsTiny Multiple Local Alignments of Many Sequences2Finding Regulatory MotifsGivena collection of genes with common expression,Findthe TF-binding motif in common...Characteristics of Regulatory Motifs• Tiny• Highly Variable• ~Constant Size– Because a constant-size transcription factor binds• Often repeated• Low-complexity-ishProblem DefinitionProbabilisticMotif: Mij; 1 ≤ i ≤ W1 ≤ j ≤ 4Mij= Prob[ letter j, pos i ]Find best M, and positions p1,…, pNin sequencesCombinatorialMotif M: m1…mWSome of the mi’s blankFind M that occurs in all siwith ≤ k differencesGiven a collection of promoter sequences s1,…, sNof genes with common expressionEssentially a Multiple Local Alignment• Find“best” multiple local alignmentAlignment score defined differently in probabilistic/combinatorial cases...Algorithms• Probabilistic1. Expectation Maximization:MEME2. Gibbs Sampling: AlignACE, BioProspector• CombinatorialCONSENSUS, TEIRESIAS, SP-STAR, othersDiscrete Approaches to Motif Finding3Discrete FormulationsGiven sequences S = {x1, …, xn}• A motif W is a consensus string w1…wK• Find motif W*with “best” match to x1, …, xnDefinition of “best”:d(W, xi) = min hamming dist. between W and a word in xid(W, S) = Σid(W, xi)Approaches• Exhaustive Searches• CONSENSUS• MULTIPROFILER, TEIRESIAS, SP-STAR, WINNOWERExhaustive Searches1. Pattern-driven algorithm:For W = AA…A to TT…T (4Kpossibilities)Find d( W, S )Report W* = argmin( d(W, S) )Running time: O( K N 4K)(where N = Σi|xi|)Advantage: Finds provably best motif WDisadvantage: TimeExhaustive Searches2. Sample-driven algorithm:For W = a K-long word in some xiFind d( W, S )Report W* = argmin( d( W, S ) )OR Report a local improvement of W*Running time: O( K N2)Advantage: TimeDisadvantage: If: True motif does not occur in data, andTrue motif is “weak”Then, random motif may score better than any instance of true motifCONSENSUS (1)Algorithm:Cycle 1:For each word W in SFor each word W’ in SCreate alignment (gap free) of W, W’Keep the C1best alignments, A1, …, AC1ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …CONSENSUS (2)Algorithm (cont’d):Cycle t:For each word W in SFor each alignment Ajfrom cycle t-1Create alignment (gap free) of W, AjKeep the Clbest alignments A1, …, ActACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …4CONSENSUS (3)C1, …, Cnare user-defined heuristic constantsRunning time:O(N2) + O(N C1) + O(N C2) + … + O(N Cn)= O( N2+ NCtotal)Where Ctotal= ΣiCi, typically O(nC), where C is a big constantMULTIPROFILER• Extended sample-driven approachGiven a K-long word W, define: Na(W) = words W’ in S s.t. d(W,W’) ≤ aIdea:Assume W is occurrence of true motif W*Will use Na(W) to correct “errors” in WMULTIPROFILER (2)Assume W differs from true motif W*in at most L positionsDefine: A wordletG of W is a L-long pattern with blanks, differing from WExample: K = 7; L = 3W = ACGTTGAG = --A--CGMULTIPROFILER (2)Algorithm:For each W in S:For L = 1 to Lmax1. Find all “strong” L-long wordlets G in Na(W)2. Modify W by the wordlet G -> W’3. Compute d(W’, S)Report W*= argmin d(W’, S)Step 1 above: Smaller motif-finding problem; Use exhaustive searchExpectation Maximization in Motif FindingExpectation Maximization in Motif FindingExpectation Maximization (1)• The MM algorithm, part of MEME package uses Expectation MaximizationAlgorithm (sketch):1. Given genomic sequences find all K-long words2. Assume each word is motif or background3. Find likeliestMotif ModelBackground Modelclassification of words into either Motif or Background5Expectation Maximization (2)• Given sequences x1, …, xN,• Find all k-long words X1,…, Xn• Define motif model: M = (M1,…, MK)Mi= (Mi1,…, Mi4) (assume {A, C, G, T})where Mij= Prob[ motif position i is letter j ]• Define background model:B = B1, …, B4Bi= Prob[ letter j in background sequence ]Expectation Maximization (3)• Define Zi1= { 1, if Xiis motif;0, otherwise }Zi2= { 0, if Xiis motif;1, otherwise }• Given a word Xi= x[1]…x[k],P[ Xi, Zi1=1 ] = λ M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - λ) Bx[1]…Bx[K]Let λ1= λ; λ2= (1- λ)Expectation Maximization (4)Define:Parameter space θ = (M,B) Objective:Maximize log likelihood of model:∑ ∑∑∑∑∑= ==== =+==2121111211log)|(log))|(log(),|,...(logj jjijnijiijnini jjijijnZZZXPXPZXXPλθθλλθExpectation Maximization (5)• Maximize expected likelihood, in iteration of two steps:Expectation:Find expected value of log likelihood:Maximization:Maximize expected value over θ, λ)],|,...([log1λθZXXPEnExpectation Maximization (6): EE--stepstepExpectation:Find expected value of log likelihood:∑ ∑∑∑= ===+=2121111log][)|(log][)],|,...([logj jjijnijiijninZZEXPEZXXPEλθλθwhere expected values of Z can be computed as follows:∑==21)|()|(kkikjijijXPXPZθλθλExpectation Maximization (7): MM--stepstepMaximization:Maximize expected value over θ and λ independentlyFor λ, this is easy:∑ ∑= ===niniijjijNEWjnZExamZj1 1log][arg λλλ6Expectation Maximization (8): MM--stepstep• For θ = (M, B), definecjk= E[ # times letter k appears in motif position j]c0k= E[ # times letter k appears in background]It easily follows:∑==41kjkjkNEWjkccM∑==4100kkkNEWkccBto not allow any 0’s, add pseudocountsInitial Parameters Matter!Consider the following “artificial” example:x1, …, xNcontain:– 2Kpatterns A…A, A…AT,……, T…T– 2Kpatterns C…C , C…CG,…… , G…G– D << 2Koccurrences of K-mer ACTG…ACTGSome local maxima:λ ≈ ½; B = ½C, ½G; Mi= ½A, ½T, i = 1,…, Kλ ≈ D/2k+1; B = ¼A,¼C,¼G,¼T; M1= 100% A, M2= 100% C, M3= 100% T, etc.Overview of EM Algorithm1. Initialize parameters θ = (M, B), λ:– Try different values of λ from N-1/2upto 1/(2K)2. Repeat:a. Expectationb. Maximization3. Until change in θ = (M,

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Stanford CS 262 - Motif Finding

Sign up for free to view:

Please select your school