DOC PREVIEW
Stanford CS 262 - Motif Finding

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

(Regulatory-) Motif FindingClustering of GenesChromatin ImmunoprecipitationFinding Regulatory MotifsCharacteristics of Regulatory MotifsEssentially a Multiple Local AlignmentAlgorithmsCombinatorial Approaches to Motif FindingDiscrete FormulationsApproachesExhaustive SearchesSlide 12MULTIPROFILERSlide 14Slide 15CONSENSUSSlide 17Slide 18Expectation Maximization in Motif FindingExpectation MaximizationSlide 21Slide 22Slide 23Slide 24Expectation Maximization: E-stepExpectation Maximization: M-stepSlide 27Initial Parameters Matter!Overview of EM AlgorithmSlide 30(Regulatory-) Motif Finding (Regulatory-) Motif FindingClustering of GenesFind binding sites responsible for common expression patternsChromatin ImmunoprecipitationMicroarray Chip“ChIP-chip”Finding Regulatory MotifsGiven a collection of genes with common expression,Find the TF-binding motif in common...Characteristics of Regulatory Motifs•Tiny•Highly Variable•~Constant SizeBecause a constant-size transcription factor binds•Often repeated•Low-complexity-ishEssentially a Multiple Local Alignment•Find “best” multiple local alignment•Alignment score defined differently in probabilistic/combinatorial cases...Algorithms•CombinatorialCONSENSUS, TEIRESIAS, SP-STAR, others•Probabilistic1. Expectation Maximization:MEME2. Gibbs Sampling: AlignACE, BioProspectorCombinatorial Approaches to Motif FindingDiscrete FormulationsGiven sequences S = {x1, …, xn}•A motif W is a consensus string w1…wK•Find motif W* with “best” match to x1, …, xnDefinition of “best”:d(W, xi) = min hamming dist. between W and any word in xid(W, S) = i d(W, xi)Approaches•Exhaustive Searches•CONSENSUS•MULTIPROFILER•TEIRESIAS, SP-STAR, WINNOWERExhaustive Searches1. Pattern-driven algorithm:For W = AA…A to TT…T (4K possibilities)Find d( W, S )Report W* = argmin( d(W, S) )Running time: O( K N 4K )(where N = i |xi|)Advantage: Finds provably “best” motif WDisadvantage: TimeExhaustive Searches2. Sample-driven algorithm:For W = any K-long word occurring in some xiFind d( W, S )Report W* = argmin( d( W, S ) )or, Report a local improvement of W*Running time: O( K N2 )Advantage: TimeDisadvantage: If the true motif is weak and does not occur in datathen a random motif may score better than any instance of true motifMULTIPROFILER•Extended sample-driven approachGiven a K-long word W, define: Nα(W) = words W’ in S s.t. d(W,W’)  αIdea: Assume W is occurrence of true motif W*Will use Nα(W) to correct “errors” in WMULTIPROFILERAssume W differs from true motif W* in at most L positionsDefine: A wordlet G of W is a L-long pattern with blanks, differing from WL is smaller than the word length KExample: K = 7; L = 3W = ACGTTGAG = --A--CGMULTIPROFILERAlgorithm:For each W in S:For L = 1 to Lmax1. Find the α-neighbors of W in S  Nα(W)2. Find all “strong” L-long wordlets G in Na(W)3. For each wordlet G,1. Modify W by the wordlet G  W’2. Compute d(W’, S)Report W* = argmin d(W’, S)Step 1 above: Smaller motif-finding problem; Use exhaustive searchCONSENSUSAlgorithm:Cycle 1:For each word W in S (of fixed length!)For each word W’ in SCreate alignment (gap free) of W, W’Keep the C1 best alignments, A1, …, AC1ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …CONSENSUSAlgorithm:Cycle t:For each word W in SFor each alignment Aj from cycle t-1Create alignment (gap free) of W, AjKeep the Cl best alignments A1, …, ACtACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …CONSENSUS•C1, …, Cn are user-defined heuristic constantsN is sum of sequence lengthsn is the number of sequencesRunning time: O(N2) + O(N C1) + O(N C2) + … + O(N Cn)= O( N2 + NCtotal)Where Ctotal = i Ci, typically O(nC), where C is a big constantExpectation Maximization in Motif FindingExpectation Maximization in Motif FindingAll K-long wordsmotifbackgroundExpectation MaximizationAlgorithm (sketch):1. Given genomic sequences find all K-long words2. Assume each word is motif or background3. Find likeliest Motif ModelBackground Modelclassification of words into either Motif or BackgroundExpectation Maximization•Given sequences x1, …, xN,•Find all k-long words X1,…, Xn•Define motif model: M = (M1,…, MK)Mi = (Mi1,…, Mi4) (assume {A, C, G, T})where Mij = Prob[ letter j occurs in motif position i ]•Define background model:B = B1, …, B4Bi = Prob[ letter j in background sequence ]motifbackgroundACGTM1MKM1BExpectation Maximization•Define Zi1 = { 1, if Xi is motif; 0, otherwise }Zi2 = { 0, if Xi is motif; 1, otherwise }•Given a word Xi = x[1]…x[k],P[ Xi, Zi1=1 ] =  M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K]Let 1 = ; 2 = (1- )motifbackgroundACGTM1MKM1B 1 – Expectation MaximizationDefine:Parameter space  = (M,B) 1: Motif; 2: BackgroundObjective:Maximize log likelihood of model:    2121111211log)|(log))|(log(),|,...(logj jjijnijiijnini jjijijnZZZXPXPZXXPACGTM1MKM1B 1 – Expectation Maximization•Maximize expected likelihood, in iteration of two steps:Expectation:Find expected value of log likelihood:Maximization:Maximize expected value over , )],|,...([log1ZXXPEnExpectation:Find expected value of log likelihood:  2121111log][)|(log][)],|,...([logj jjijnijiijninZZEXPEZXXPEwhere expected values of Z can be computed as follows:ijiijijijZXPXPXPZE *)|()1()|()|(][21Expectation Maximization: E-stepExpectation Maximization: M-stepMaximization:Maximize expected value over  and  independentlyFor , this is easy:  niniiiiNEWnZxamZZ1 1121*))1log(log(arg**•For  = (M, B), definecjk = E[ # times letter k appears in motif position j]c0k = E[ # times letter k appears in background]•cij values are calculated easily from Z* valuesIt easily follows:41kjkjkNEWjkccM4100kkkNEWkccBto not allow any 0’s, add pseudocountsExpectation Maximization: M-stepInitial Parameters Matter!Consider the following “artificial” example:x1, …, xN contain:212 patterns on {A, T}: A…A, A…AT,……, T… T212 patterns on {C, G}: C…C , C…CG,…… , G…GD << 212 occurrences of 12-mer ACTGACTGACTGSome local maxima: ½; B = ½C, ½G; Mi = ½A, ½T, i


View Full Document

Stanford CS 262 - Motif Finding

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Motif Finding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Motif Finding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Motif Finding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?