Gibbs Sampling in Motif FindingGibbs SamplingSlide 3Slide 4Slide 5Slide 6Slide 7Advantages / DisadvantagesRepeats, and a Better Background ModelExample Application: Motifs in YeastProcessing of DataMotifs in Periodic ClustersMotifs in Non-periodic ClustersPhylogenetic Footprinting (Slides by Martin Tompa)Phylogenetic Footprinting (Tagle et al. 1988)Slide 16Slide 17Slide 18Slide 19An Exact Algorithm (generalizing Sankoff and Rousseau 1975)Slide 21Slide 22Slide 23Limits of Motif FindersSlide 25RNA Secondary StructureRNA and TranslationRNA and SplicingSlide 29Slide 30Slide 31Slide 32Modeling RNA Secondary Structure: Context-Free GrammarsA Context Free GrammarExample: modeling a stem loopSlide 36A parse tree: alignment of CFG to sequenceAlignment scores for parses!The Nussinov AlgorithmSlide 40Slide 41The Nussinov Algorithm and CFGsSlide 43Stochastic Context Free GrammarsExampleGibbs Sampling in Motif FindingGibbs Sampling in Motif FindingGibbs Sampling•Given: x1, …, xN, motif length K,background B,•Find:Model MLocations a1,…, aN in x1, …, xNMaximizing log-odds likelihood ratio: NiKkikaikaiixBxkM1 1)(),(logGibbs Sampling•AlignACE: first statistical motif finder•BioProspector: improved version of AlignACEAlgorithm (sketch):1. Initialization:a. Select random locations in sequences x1, …, xNb. Compute an initial model M from these locations2. Sampling Iterations:a. Remove one sequence xib. Recalculate modelc. Pick a new location of motif in xi according to probability the location is a motif occurrenceGibbs SamplingInitialization:•Select random locations a1,…, aN in x1, …, xN•For these locations, compute M:NikakjjxNMi1)(1•That is, Mkj is the number of occurrences of letter j in motif position k, over the totalGibbs SamplingPredictive Update:•Select a sequence x = xi•Remove xi, recompute model:))(()1(1,1NisskajkjjxBNMswhere j are pseudocounts to avoid 0s,and B = j j MGibbs SamplingSampling:For every K-long word xj,…,xj+k-1 in x:Qj = Prob[ word | motif ] = M(1,xj)…M(k,xj+k-1)Pi = Prob[ word | background ] B(xj)…B(xj+k-1)Let Sample a random new position ai according to the probabilities A1,…, A|x|-k+1.1||1//kxjjjjjjPQPQA0 |x|ProbGibbs SamplingRunning Gibbs Sampling:1. Initialize2. Run until convergence3. Repeat 1,2 several times, report common motifsAdvantages / Disadvantages•Very similar to EMAdvantages:•Easier to implement•Less dependent on initial parameters•More versatile, easier to enhance with heuristicsDisadvantages:•More dependent on all sequences to exhibit the motif•Less systematic search of initial parameter spaceRepeats, and a Better Background Model•Repeat DNA can be confused as motifEspecially low-complexity CACACA… AAAAA, etc.Solution: more elaborate background model0th order: B = { pA, pC, pG, pT }1st order: B = { P(A|A), P(A|C), …, P(T|T) }…Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }Has been applied to EM and Gibbs (up to 3rd order)Example Application: Motifs in YeastGroup:Tavazoie et al. 1999, G. Church’s lab, HarvardData:•Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.)•15 time points across two cell cyclesProcessing of Data1. Selection of 3,000 genesGenes with most variable expression were selected2. Clustering according to common expression•K-means clustering•30 clusters, 50-190 genes/cluster•Clusters correlate well with known function3. AlignACE motif finding •600-long upstream regions•50 regions/trialMotifs in Periodic ClustersMotifs in Non-periodic ClustersPhylogenetic FootprintingPhylogenetic Footprinting(Slides by Martin Tompa)(Slides by Martin Tompa)Phylogenetic Footprinting(Tagle et al. 1988)•Functional sequences evolve slower than nonfunctional ones•Consider a set of orthologous sequences from different species•Identify unusually well conserved regionsSubstring Parsimony ProblemGiven:•phylogenetic tree T,•set of orthologous sequences at leaves of T,•length k of motif•threshold dProblem:•Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.This problem is NP-hard.Small ExampleAGTCGTACGTGAC... (Human)AGTAGACGTGCCG... (Chimp)ACGTGAGATACGT... (Rabbit)GAACGGAGTACGT... (Mouse)TCGTGACGGTGAT... (Rat)Size of motif sought: k = 4SolutionParsimony score: 1 mutationAGTCGTACGTGAC...AGTAGACGTGCCG...ACGTGAGATACGT...GAACGGAGTACGT...TCGTGACGGTGAT...ACGGACGTACGTACGTCLUSTALW multiple sequence alignment (rbcS gene)Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACCCotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCACotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea
View Full Document