Methods for the discovery of cis regulatory modules 3 Comparative genomics Statistics 246 Week 15 Spring 2006 Lecture 1 Introduction In this lecture I ll discuss three papers which seek to find TFBSs using evolutionary conservation When we look for these in clusters it is another way to find cis regulatory modules The first paper uses an algorithm that its authors call the Bayes Block Aligne BBA I will not describe this algorithm in detail for two reasons First it is a relatively straightforward variant on the Gibbs sampler that we discussed last week and details can be found in Zhu et al 1998 referred to there Second it is no longer a method anyone would use for identifying evolutionarily conserved genomic sequence cf PhastCons later in this lecture However this first paper is of historical interest as being typical of several from around 2000 Basic Idea Motifs are sensitive but not very specific i e false positive rates are high when we try to predict them We attempt to increase specificity by requiring conservation across species A caveat TF binding sites are now thought to evolve much faster than coding sequences variation in these sites is currently thought to explain most differences between e g humans and chimpanzees rats or mice Of course this is just a thought not an established truth Result of using the Bayes Block Aligner 1 Here we show the probability of alignment for the sequence flanking the 5 end of the first exon of natriuretic propeptide NPPA The two dimensional histogram output of the BBA indicates the probability that any given base j in the human sequence aligns to any given base k in the rat sequence In the next slide we sum over k Result of using the Bayes Block Aligner 2 Probability that a nucleotide in the human NPPA 5 flanking sequence aligns to any nucleotide in the rat sequence A broader comparison 1 The authors next compared 28 orthologous gene pairs that are specifically upregulated in skeletal muscle and for which there was in 2000 considerable genomic sequence available For the gene names see the paper A set of 99 experimentally defined binding sites exist in their data including 24 Sp1 sites G C rich sites and 75 sequence specific sites characterized as follows i Myogen family Myf sites E boxes ii Mef2 sites A T rich sequences iii SRF sites CArG boxes iv Tef sites MCAT boxes and v other experimentally defined but incompletely characterized sites In the next 3 slides we summarize their results with this data set A broader comparison 2 Comment data like this now exists on a genome wide basis A broader comparison 3 Conservation of genomic sequence between human and rodents for alignments where lengthrodent lengthhuman 0 5 Fraction of human nucleotides Identified as conserved in the comparison of human and rodent sequences A broader comparison 4 74 75 of skeletal muscle specific TF binding sites are confined to the 19 of human sequences that are most conserved in mouse Finding TFBSs in unaligned regulatory sequence from coregulated genes Last week we saw how the Gibbs sampler might help in identifying motifs from unaligned sequence from sets of co expressed genes arising from microarray experiments We did not show this but it works for yeast where the search space is typically 200 500 bp from the TSS It does not work for multicellular organisms where the relevant regions are typically a few kbp and can be 10s or 100s of kbp from the TSS A reduction by factor of 5 of the search space through phylogenetic footprinting from human mouse is therefore helpful Greater reductions are possible using conservation across more species but of course this comes at a price IIlustration using the skeletal muscle genes 1 Apply the Gibbs sampler to contextual non coding sequence of many kbp around the human genes find meaningless patterns 2 Apply it to just the 19 of sequence found by phylogenetic footprinting find many of the sites we know are there and get their motifs Comparative analysis of the genome sequence of four yeast spp Tree based on genic sequence Tree based on intergenic sequence Branch lengths av of substitutions site The data 12 000 multiple alignments There are many uses of these data Studying large scale genome evolution e g at telomeres Studying genome evolution at the nucleotide level e g different rates within and between ORFs cf the previous page and in different branches of the tree Gene identification Mapping gene structure start stop introns etc Identifying rapidly and slowly evolving genes Genome wide identification of regulatory elements Gal4 binding sites Known motifs used to calibrate a Motif Conservation Score Method for genome wide motif discovery 1 Find seeds mini motifs of the form UVWn 0 21 XYZ These can be enumerated coming to 45 760 if reverse complements are grouped together 2 A mini motif is called conserved if details omitted i the conservation score is significantly high in intergenic regions ii The conservation score is higher in intergenic regions than in genic regions iii The conservation rate is different upstream and downstream of genes 3 Conserved mini motifs are then extended to full motifs by looking for correlated bases clustering and merging Again the details are omitted Motifs are then scored in various ways most importantly via the previously defined motif conservation score MCS using the previously calibrated threshold The paper lists 72 discovered motifs As it would take up too much space I won t list the 72 discovered motifs These were obtained by applying a threshold to the MCS and removing overlaps with known motifs leaving 42 new motifs Attempts were then made to assign candidate functions to these new motifs using the categories for genes containing them in regulatory regions based on a collection of experimental data Again details are omitted For us perhaps the most interesting next step was their brief discussion of combinatorial control and we now turn to that 18 Combinatorial control The authors searched for motifs that occur together in the same intergenic region much more frequently than would be expected by chance This is an interesting and important statistical problem one far from completely solved They state that with a single genome few signficant correlations are found The basis for this statement isn t clear though of course it sounds reasonable They assert that the Ste12 and Tec1 motifs are found to be significantly correlated with about 20 of the regions having a conserved occurrence of one also
View Full Document