Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science Engineering University of Washington Outline How are genes regulated What is phylogenetic footprinting First solution Improvements and extensions Application to regulation of several important genes 2 Regulation of Genes What turns genes on and off When is a gene turned on or off Where in which cells is a gene turned on How many copies of the gene product are produced 3 Regulation of Genes Transcription Factor RNA polymerase DNA Regulatory Element Coding region 4 Regulation of Genes Transcription Factor RNA polymerase DNA Regulatory Element Coding region 5 Goal Identify regulatory elements in DNA sequences These are Binding sites for proteins Short substrings 5 25 nucleotides Up to 1000 nucleotides or farther from gene Inexactly repeating patterns motifs 6 Phylogenetic Footprinting Tagle et al 1988 Functional sequences evolve slower than nonfunctional ones Consider a set of orthologous sequences from different species Identify unusually well conserved regions 7 Substring Parsimony Problem Given phylogenetic tree T set of orthologous sequences at leaves of T length k of motif threshold d Problem Find each set S of k mers one k mer from each leaf such that the parsimony score of S in T is at most d This problem is NP hard 8 Small Example AGTCGTACGTGAC Human AGTAGACGTGCCG Chimp ACGTGAGATACGT Rabbit GAACGGAGTACGT Mouse TCGTGACGGTGAT Rat Size of motif sought k 4 9 Solution ACGT AGTCGTACGTGAC AGTAGACGTGCCG ACGT ACGTGAGATACGT ACGT ACGG GAACGGAGTACGT TCGTGACGGTGAT Parsimony score 1 mutation 10 CLUSTALW multiple sequence alignment rbcS gene Cotton Pea Tobacco Ice plant Turnip Wheat Duckweed Larch ACGGTT TCCATTGGATGA AATGAGATAAGAT CACTGTGC TTCTTCCACGTG GCAGGTTGCCAAAGATA AGGCTTTACCATT GTTTTT TCAGTTAGCTTA GTGGGCATCTTA CACGTGGC ATTATTATCCTA TT GGTGGCTAATGATA AGG TTAGCACA TAGGAT GAGATAAGATTA CTGAGGTGCTTTA CACGTGGC ACCTCCATTGTG GT GACTTAAATGAAGA ATGGCTTAGCACC TCCCAT ACATTGACATAT ATGGCCCGCCTGCGGCAACAAAAA AACTAAAGGATA GCTAGTTGCTACTACAATTC CCATAACTCACCACC ATTCAT ATAAATAGAAGG TCCGCGAACATTG AAATGTAGATCATGCGTCAGAATT GTCCTCTCTTAATAGGA A GGAGC TATGAT AAAATGAAATAT TTTGCCCAGCCA ACTCAGTCGCATCCTCGGACAA TTTGTTATCAAGGAACTCAC CCAAAAACAAGCAAA TCGGAT GGGGGGGCATGAACACTTGCAATCATT TCATGACTCATTTCTGAACATGT GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA TAACAT ATGATATAACAC CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA TGAAAGTACAAGACC Cotton Pea Tobacco Ice plant Turnip Wheat Duckweed Larch CAAGAAAAGTTTCCACCCTC TTTGTGGTCATAATG GTT GTAATGTC ATCTGATTT AGGATCCAACGTCACCCTTTCTCCCA A C AAAACTTTTCAATCT TGTGTGGTTAATATG ACT GCAAAGTTTATCATTTTC ACAATCCAACAA ACTGGTTCT A AAAAATAATTTTCCAACCTTT CATGTGTGGATATTAAG ATTTGTATAATGTATCAAGAACC ACATAATCCAATGGTTAGCTTTATTCCAAGATGA ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG ATAAGATATGGGTTCCTGCCAC GTGGCACCATACCATGGTTTGTTA ACGATAA CAAAAGCATTGGCTCAAGTTG AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG ATAAGATAAGATAATGTTATTTCT A GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT GAAATTTCCAGCACACACA A TGTATCCGACGGCAATGCTTCTTC ATATAATATTAGAAAAAAATC TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA CTAGACTCCAATTTACCCAAATCACTAACCAATT TTCTCGTATAAGGCCACCA TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA CTATCAGATATGGTAGTGGGATCTG ACGGTCA Cotton Pea Tobacco Ice plant Turnip Wheat Duckweed Larch ACCAATCTCT AAATGTT GTGAGCT TAG GCCAAATTT TATGACTATA TAT AGGGGATTGCACC AAGGCAGTG ACACTA GGCAGTGGCC AACTAC CACAATTT TAAGACCATAA TAT TGGAAATAGAA AAATCAAT ACATTA GGGGGTTGTT GATTTTT GTCCGTTAGATAT GCGAAATATGTAAAACCTTAT CAT TATATATAGAG TGGTGGGCA ACGATG GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT GATGAGTTTTAAGGTCCTTAT TATA TATAGGAAGGGGG TGCTATGGA GCAAGG CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT CATTAGGGCTTCATACCTCT TGCGCTTCTCACTATA CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA CATCTGTACCAAAGAAACGG GGCTATATATACCGTG TTAGGTTGAATGGAAAATAG AACGCAATAATGTCCGACATATTTCCTATATTTCCG TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC CCA ACCTTAGAGATTG GGGCTTATA TCTATA Cotton Pea Tobacco Ice plant Larch Turnip Wheat Duckweed T TAAGGGATCAGTGAGAC TCTTTTGTATAACTGTAGCAT ATAGTAC TATAAAGCAAGTTTTAGTA CAAGCTTTGCAATTCAACCAC A AGAAC CATAGACCATCTTGGAAGT TTAAAGGGAAAAAAGGAAAAG GGAGAAA TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC TCTTCTTCACAC AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TATAGATAACCA AAGCAATAGACAGACAAGTAAGTTAAG AGAAAAG GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC CATGGGGCGACG CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG 11 An Exact Algorithm generalizing Sankoff and Rousseau 1975 Wu s best parsimony score for subtree rooted at node u if u is labeled with string s 4k entries ACGG 1 ACGT 0 ACGG ACGT 0 AGTCGTACGTG ACGG ACGT 0 ACGG 2 ACGT 1 ACGG ACGT 0 ACGG ACGT 0 ACGG 1 ACGT 1 ACGG 0 ACGT 2 ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG ACGG 0 ACGT 12 Recurrence Wu s min Wv t d s t v child t of u 13 Running Time Wu s min Wv t d s t v child t 2k O k 4 time per node of u 14 Running Time Wu s min Wv t d s t v child t 2k O k 4 time per node of u Number of species Average sequence length Total time O n k 42k l Motif length 15 Improvements Better algorithm reduces time from O n k 42k l to O n k 4k l By restricting to motifs with parsimony score at most d greatly reduce the number of table entries computed exponential in d polynomial in k Amenable to many useful extensions e g allow insertions and deletions 16 Application to actin Gene Gilthead sea bream 678 bp Medaka fish 1016 bp Common carp 696 bp Grass carp 917 bp Chicken 871 bp Human 646 bp Rabbit 636 bp Rat 966 bp Mouse 684 bp Hamster 1107 bp 17 Common carp TTGGCATGGCTTTTGTTATTTTTGGCGCTTG ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACA ACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGG ACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTG GGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTG GTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTG CCTGTACACTGAC GGGCCAA TAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTT GCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAA GCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGA TTGGCATGGCTTTATTTGTTT TTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGA
View Full Document