DOC PREVIEW
Stanford CS 262 - Motif Finding

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Motif FindingMotif FindingRegulation of GenesGeneRegulatory ElementRNA polymerase(Protein)Transcription Factor(Protein)DNARegulation of GenesGeneRNA polymeraseTranscription Factor(Protein)Regulatory ElementDNARegulation of GenesGeneRNA polymeraseTranscription FactorRegulatory ElementDNANew proteinMicroarrays• A 2D array of DNA sequences from thousands of genes• Each spot has many copies of same gene• Allow mRNAs from a sample to hybridize• Measure number of hybridizations per spotFinding Regulatory MotifsFinding Regulatory MotifsTiny Multiple Local Alignments of Many Sequences2Finding Regulatory MotifsGivena collection of genes with common expression,Findthe TF-binding motif in common...Characteristics of Regulatory Motifs• Tiny• Highly Variable• ~Constant Size– Because a constant-size transcription factor binds• Often repeated• Low-complexity-ishProblem DefinitionProbabilisticMotif: Mij; 1 ≤ i ≤ W1 ≤ j ≤ 4Mij= Prob[ letter j, pos i ]Find best M, and positions p1,…, pNin sequencesCombinatorialMotif M: m1…mWSome of the mi’s blankFind M that occurs in all siwith ≤ k differencesGiven a collection of promoter sequences s1,…, sNof genes with common expressionEssentially a Multiple Local Alignment• Find“best” multiple local alignmentAlignment score defined differently in probabilistic/combinatorial cases...Algorithms• Probabilistic1. Expectation Maximization:MEME2. Gibbs Sampling: AlignACE, BioProspector• CombinatorialCONSENSUS, TEIRESIAS, SP-STAR, othersDiscrete Approaches to Motif Finding3Discrete FormulationsGiven sequences S = {x1, …, xn}• A motif W is a consensus string w1…wK• Find motif W*with “best” match to x1, …, xnDefinition of “best”:d(W, xi) = min hamming dist. between W and a word in xid(W, S) = Σid(W, xi)Approaches• Exhaustive Searches• CONSENSUS• MULTIPROFILER, TEIRESIAS, SP-STAR, WINNOWERExhaustive Searches1. Pattern-driven algorithm:For W = AA…A to TT…T (4Kpossibilities)Find d( W, S )Report W* = argmin( d(W, S) )Running time: O( K N 4K)(where N = Σi|xi|)Advantage: Finds provably best motif WDisadvantage: TimeExhaustive Searches2. Sample-driven algorithm:For W = a K-long word in some xiFind d( W, S )Report W* = argmin( d( W, S ) )OR Report a local improvement of W*Running time: O( K N2)Advantage: TimeDisadvantage: If: True motif does not occur in data, andTrue motif is “weak”Then, random motif may score better than any instance of true motifCONSENSUS (1)Algorithm:Cycle 1:For each word W in SFor each word W’ in SCreate alignment (gap free) of W, W’Keep the C1best alignments, A1, …, AC1ACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …CONSENSUS (2)Algorithm (cont’d):Cycle t:For each word W in SFor each alignment Ajfrom cycle t-1Create alignment (gap free) of W, AjKeep the Clbest alignments A1, …, ActACGGTTG , CGAACTT , GGGCTCT …ACGCCTG , AGAACTA , GGGGTGT …… … …ACGGCTC , AGATCTT , GGCGTCT …4CONSENSUS (3)C1, …, Cnare user-defined heuristic constantsRunning time:O(N2) + O(N C1) + O(N C2) + … + O(N Cn)= O( N2+ NCtotal)Where Ctotal= ΣiCi, typically O(nC), where C is a big constantMULTIPROFILER• Extended sample-driven approachGiven a K-long word W, define: Na(W) = words W’ in S s.t. d(W,W’) ≤ aIdea:Assume W is occurrence of true motif W*Will use Na(W) to correct “errors” in WMULTIPROFILER (2)Assume W differs from true motif W*in at most L positionsDefine: A wordletG of W is a L-long pattern with blanks, differing from WExample: K = 7; L = 3W = ACGTTGAG = --A--CGMULTIPROFILER (2)Algorithm:For each W in S:For L = 1 to Lmax1. Find all “strong” L-long wordlets G in Na(W)2. Modify W by the wordlet G -> W’3. Compute d(W’, S)Report W*= argmin d(W’, S)Step 1 above: Smaller motif-finding problem; Use exhaustive searchExpectation Maximization in Motif FindingExpectation Maximization in Motif FindingExpectation Maximization (1)• The MM algorithm, part of MEME package uses Expectation MaximizationAlgorithm (sketch):1. Given genomic sequences find all K-long words2. Assume each word is motif or background3. Find likeliestMotif ModelBackground Modelclassification of words into either Motif or Background5Expectation Maximization (2)• Given sequences x1, …, xN,• Find all k-long words X1,…, Xn• Define motif model: M = (M1,…, MK)Mi= (Mi1,…, Mi4) (assume {A, C, G, T})where Mij= Prob[ motif position i is letter j ]• Define background model:B = B1, …, B4Bi= Prob[ letter j in background sequence ]Expectation Maximization (3)• Define Zi1= { 1, if Xiis motif;0, otherwise }Zi2= { 0, if Xiis motif;1, otherwise }• Given a word Xi= x[1]…x[k],P[ Xi, Zi1=1 ] = λ M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - λ) Bx[1]…Bx[K]Let λ1= λ; λ2= (1- λ)Expectation Maximization (4)Define:Parameter space θ = (M,B) Objective:Maximize log likelihood of model:∑ ∑∑∑∑∑= ==== =+==2121111211log)|(log))|(log(),|,...(logj jjijnijiijnini jjijijnZZZXPXPZXXPλθθλλθExpectation Maximization (5)• Maximize expected likelihood, in iteration of two steps:Expectation:Find expected value of log likelihood:Maximization:Maximize expected value over θ, λ)],|,...([log1λθZXXPEnExpectation Maximization (6): EE--stepstepExpectation:Find expected value of log likelihood:∑ ∑∑∑= ===+=2121111log][)|(log][)],|,...([logj jjijnijiijninZZEXPEZXXPEλθλθwhere expected values of Z can be computed as follows:∑==21)|()|(kkikjijijXPXPZθλθλExpectation Maximization (7): MM--stepstepMaximization:Maximize expected value over θ and λ independentlyFor λ, this is easy:∑ ∑= ===niniijjijNEWjnZExamZj1 1log][arg λλλ6Expectation Maximization (8): MM--stepstep• For θ = (M, B), definecjk= E[ # times letter k appears in motif position j]c0k= E[ # times letter k appears in background]It easily follows:∑==41kjkjkNEWjkccM∑==4100kkkNEWkccBto not allow any 0’s, add pseudocountsInitial Parameters Matter!Consider the following “artificial” example:x1, …, xNcontain:– 2Kpatterns A…A, A…AT,……, T…T– 2Kpatterns C…C , C…CG,…… , G…G– D << 2Koccurrences of K-mer ACTG…ACTGSome local maxima:λ ≈ ½; B = ½C, ½G; Mi= ½A, ½T, i = 1,…, Kλ ≈ D/2k+1; B = ¼A,¼C,¼G,¼T; M1= 100% A, M2= 100% C, M3= 100% T, etc.Overview of EM Algorithm1. Initialize parameters θ = (M, B), λ:– Try different values of λ from N-1/2upto 1/(2K)2. Repeat:a. Expectationb. Maximization3. Until change in θ = (M,


View Full Document

Stanford CS 262 - Motif Finding

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Motif Finding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Motif Finding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Motif Finding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?