Unformatted text preview:

MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.Lecture 15 - Comparative Genomics I: Genome annotation 11/23/08 1 Introduction This lecture and the next will discuss the recent and current research in comparative genomics being performed in Professor Kellis’ lab. Comparative genomics allows one to infer understanding of genomes from the study of the evolution of closely related species, and vice-versa. This lecture will discuss the use of evolution to understand genomes, and lecture 16 will deal with using genomes to better understand evolution. By understanding genomes, we mean primarily to annotate the various parts: protein coding regions, regulatory motifs, etc. We’ll see later that comparative genomics also allows us to uncover completely new ways various elements are processed that we would not recognize using other methods. In Dr. Kellis’ lab, mammals, flies and fungi are studied. Slide 6 shows the many species that are part of the data sets that are analyzed. We want to study a wide variety of organisms to observe elements that are at different distances from humans. This allows the study of processes at different ranges of evolution (different snapshots in time based on divergence point). There are several reasons why it is important to have closely related species as well as more distantly related species. More closely related species should have very similar functional elements and randomness in the non-functional elements. This is because selection weeds out disrupting mutations in functional regions, and mutations accumulate in the non-functional regions. More distantly related species will likely have significant differences in both their functional and non-functional elements. Phylogeny allows observation of individual events that may be difficult to resolve in species that are more separated. However, our signal relies on the ability to identify/observe an evolutionary event, thus if we look only at species that are close, there won’t be enough changes to discriminate between functional and non-functional regions. More distantly related species allow us to better identify neutral substitutions. 2 Preliminary steps in comparative genomics Once we have our sequence data (or if we have a new sequence that we wish to annotate), we begin with multiple alignments of the sequences. We BLAST regions of the genome against other genomes, and then apply sequence alignment techniques to align individual regions to a reference genome for a pair of species. We then perform a series of pairwise alignments walking up the phylogeny until we have an alignment for all sequences. Because we can align every gene and every intergenic region, we don’t just have to rely on conserved regions, we can align every single region regardless of whether the conservation in that region is sufficient to allow genome wide 1placement across the species. This is because we have ’anchors’ spanning entire regions, and we can thus infer that the entire region in conserved as a block and then apply global alignment to the block. 3 Evolutionary signatures Slide 10 shows results for nucleotide conservation in the DBH gene across several species (note that other species that are not show on the slide were also used to calculate conservation). To calculate the degree of conservation a hidden Markov model (HMM) was used with two states: high conservation and low conservation. The Y-axis shows the score calculated using posterior decoding with this model. There are several interesting features we can observe from this data. We see that there are blocks of conservation separated by regions that are not conserved. The 12 exons are mostly conserved across the species, but certain exons are missing (e.g. zebrafish is missing exon 9). Certain intronic areas have stretches of high conservation as well. We also note the existence of lineage-specific conserved elements. If there’s a region that’s thought to be intronic but still appears to be highly conserved, then this is evidence for that region being functional. We want to develop evolutionary signals for each of the functional types in the genome. The specific function of a region results in selective pressures which give it a characteristic signature of insertions/deletions/mutations. Protein-coding genes exhibit particular frequencies of codon substitution as well as reading frame conservation. RNA structures have compensatory changes to maintain their secondary structure. µRNAs look different from RNA genes, here paired regions are not undergoing the compensatory changes that occurred above, they are very highly conserved. Intermediate regions are able to diverge. Regulatory motifs are not conserved at the exact same position, they can move around since they only need to recruit a factor in a particular region. They show an increased conservation phylogenetically across the tree, while showing small changes that preserve the consensus of the motif, while the primary sequence can still change. This lecture will discuss further how to determine protein-coding signatures. 4 Protein-coding signatures Slide 12 shows a region of a gene near a splice site. The same level of conservation exists on both sides of the splice site, but we notice significant differences between the sequences to the left and to the right of the splice site. Recognizing these differences allows us to construct our signature. To the right of the splice site, gaps occur in multiples of three (thus conserving the frame), whereas to the left of the splice site frame-shifting occurs. There is also a distinct pattern to the mutations on the right side, as the mutations are largely 3-periodic and certain triplets are more frequently exchanged. As a bonus, by being able to recognize the change in regions, our splice site becomes immediately obvious as well. By testing for (i) reading-frame conservation and (ii) codon-substitution patterns, we can identify protein coding regions very accurately. 4.1 Reading-Frame Conservation By scoring the pressure to stay in the same reading frame (i.e. no gaps that are not multiples of three), we can easily quantify how likely a region is protein-coding or not. Staying in the same frame


View Full Document
Download Comparative Genomics I - Genome annotation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Comparative Genomics I - Genome annotation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Comparative Genomics I - Genome annotation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?