Stanford CS 262 - Motif Finding - D3056376

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Motif Finding

DOC PREVIEW

Stanford CS 262 - Motif Finding

School name Stanford University

Course Cs 262- Computational Genomics

Pages 30

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 30 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

(Regulatory-) Motif FindingClustering of GenesChromatin ImmunoprecipitationFinding Regulatory MotifsCharacteristics of Regulatory MotifsEssentially a Multiple Local AlignmentAlgorithmsCombinatorial Approaches to Motif FindingDiscrete FormulationsApproachesExhaustive SearchesSlide 12MULTIPROFILERSlide 14Slide 15CONSENSUSSlide 17Slide 18Expectation Maximization in Motif FindingExpectation MaximizationSlide 21Slide 22Slide 23Slide 24Expectation Maximization: E-stepExpectation Maximization: M-stepSlide 27Initial Parameters Matter!Overview of EM AlgorithmSlide 30(Regulatory-) Motif Finding (Regulatory-) Motif FindingClustering of GenesFind binding sites responsible for common expression patternsChromatin ImmunoprecipitationMicroarray Chip“ChIP-chip”Finding Regulatory MotifsGiven a collection of genes with common expression,Find the TF-binding motif in common...Characteristics of Regulatory Motifs•Tiny•Highly Variable•~Constant SizeBecause a constant-size transcription factor binds•Often repeated•Low-complexity-ishEssentially a Multiple Local Alignment•Find “best” multiple local alignment•Alignment score defined differently in probabilistic/combinatorial cases...Algorithms•CombinatorialCONSENSUS, TEIRESIAS, SP-STAR, others•Probabilistic1. Expectation Maximization:MEME2. Gibbs Sampling: AlignACE, BioProspectorCombinatorial Approaches to Motif FindingDiscrete FormulationsGiven sequences S = {x1, …, xn}•A motif W is a consensus string w1…wK•Find motif W* with “best” match to x1, …, xnDefinition of “best”:d(W, xi) = min hamming dist. between W and any word in xid(W, S) = i d(W, xi)Approaches•Exhaustive Searches•CONSENSUS•MULTIPROFILER•TEIRESIAS, SP-STAR, WINNOWERExhaustive Searches1. Pattern-driven algorithm:For W = AA…A to TT…T (4K possibilities)Find d( W, S )Report W* = argmin( d(W, S) )Running time: O( K N 4K )(where N = i |xi|)Advantage: Finds provably “best” motif WDisadvantage: TimeExhaustive Searches2. Sample-driven algorithm:For W = any K-long word occurring in some xiFind d( W, S )Report W* = argmin( d( W, S ) )or, Report a local improvement of W*Running time: O( K N2 )Advantage: TimeDisadvantage: If the true motif is weak and does not occur in datathen a random motif may score better than any instance of true motifMULTIPROFILER•Extended sample-driven approachGiven a K-long word W, define: Nα(W) = words W’ in S s.t. d(W,W’)  αIdea: Assume W is occurrence of true motif W*Will use Nα(W) to correct “errors” in WMULTIPROFILERAssume W differs from true motif W* in at most L positionsDefine: A wordlet G of W is a L-long pattern with blanks, differing from WL is smaller than the word length KExample: K = 7; L = 3W = ACGTTGAG = --A--CGMULTIPROFILERAlgorithm:For each W in S:For L = 1 to Lmax1. Find the α-neighbors of W in S  Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,1. Modify W by the wordlet G  W’2. Compute d(W’, S)Report W* = argmin d(W’, S)Step 1 above: Smaller motif-finding problem; Use exhaustive searchCONSENSUSAlgorithm:Cycle 1:For each word W in S (of fixed length!)For each word W’ in SCreate alignment (gap free) of W, W’Keep the C1 best alignments, A1, …, AC1ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …CONSENSUSAlgorithm:Cycle t:For each word W in SFor each alignment Aj from cycle t-1Create alignment (gap free) of W, AjKeep the Cl best alignments A1, …, ACtACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …CONSENSUS•C1, …, Cn are user-defined heuristic constantsN is sum of sequence lengthsn is the number of sequencesRunning time: O(N2) + O(N C1) + O(N C2) + … + O(N Cn)= O( N2 + NCtotal)Where Ctotal = i Ci, typically O(nC), where C is a big constantExpectation Maximization in Motif FindingExpectation Maximization in Motif FindingAll K-long wordsmotifbackgroundExpectation MaximizationAlgorithm (sketch):1. Given genomic sequences find all K-long words2. Assume each word is motif or background3. Find likeliest Motif ModelBackground Modelclassification of words into either Motif or BackgroundExpectation Maximization•Given sequences x1, …, xN,•Find all k-long words X1,…, Xn•Define motif model: M = (M1,…, MK)Mi = (Mi1,…, Mi4) (assume {A, C, G, T})where Mij = Prob[ letter j occurs in motif position i ]•Define background model:B = B1, …, B4Bi = Prob[ letter j in background sequence ]motifbackgroundACGTM1MKM1BExpectation Maximization•Define Zi1 = { 1, if Xi is motif; 0, otherwise }Zi2 = { 0, if Xi is motif; 1, otherwise }•Given a word Xi = x[1]…x[k],P[ Xi, Zi1=1 ] =  M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K]Let 1 = ; 2 = (1- )motifbackgroundACGTM1MKM1B 1 – Expectation MaximizationDefine:Parameter space  = (M,B) 1: Motif; 2: BackgroundObjective:Maximize log likelihood of model:    2121111211log)|(log))|(log(),|,...(logj jjijnijiijnini jjijijnZZZXPXPZXXPACGTM1MKM1B 1 – Expectation Maximization•Maximize expected likelihood, in iteration of two steps:Expectation:Find expected value of log likelihood:Maximization:Maximize expected value over , )],|,...([log1ZXXPEnExpectation:Find expected value of log likelihood:  2121111log][)|(log][)],|,...([logj jjijnijiijninZZEXPEZXXPEwhere expected values of Z can be computed as follows:ijiijijijZXPXPXPZE *)|()1()|()|(][21Expectation Maximization: E-stepExpectation Maximization: M-stepMaximization:Maximize expected value over  and  independentlyFor , this is easy:  niniiiiNEWnZxamZZ1 1121*))1log(log(arg**•For  = (M, B), definecjk = E[ # times letter k appears in motif position j]c0k = E[ # times letter k appears in background]•cij values are calculated easily from Z* valuesIt easily follows:41kjkjkNEWjkccM4100kkkNEWkccBto not allow any 0’s, add pseudocountsExpectation Maximization: M-stepInitial Parameters Matter!Consider the following “artificial” example:x1, …, xN contain:212 patterns on {A, T}: A…A, A…AT,……, T… T212 patterns on {C, G}: C…C , C…CG,…… , G…GD << 212 occurrences of 12-mer ACTGACTGACTGSome local maxima: ½; B = ½C, ½G; Mi = ½A, ½T, i

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Stanford CS 262 - Motif Finding

Sign up for free to view:

Please select your school