Stanford STATS 345 - Examining the Current Problems of Whole Genome Comparison - A Review - D555679

Home> Schools> Stanford University> Statistics (STATS) > STATS 345> Examining the Current Problems of Whole Genome Comparison - A Review

Stanford STATS 345 - Examining the Current Problems of Whole Genome Comparison - A Review

School name Stanford University

Course Stats 345- Statistical And Machine Learning Methods For Genomics (bio 268, Biomedin 245, Cs 373, Gene 245)

Pages 26

Download Save

Unformatted text preview:

Examining the Current Problems of WholeGenome Comparison: A ReviewSequencing: Too fast?Annotation: Not so fast?Comparative Genomics: To the RescueAlignments: Problems and ProgressVisualizing Data: Not as Easy as it LooksThe Progression of Genome Sequence Alignment ProgramsASSIRC – Accelerated Search for SImilarity Regions in ChromosomesDIALIGN – DIagonal ALIGNmentDBA – DNA Block AlignerPipMaker – Percent Identity Plot MAKERGLASS – GLobal Alignment SyStemWABA – Wobble Aware Bulk AlignerLSH-ALL-PAIRS – Locality-Sensitive Hashing in ALL PAIRSThe Progression of Visualization Tools for Displaying Genomic ComparisonsConclusionACTExamining the Current Problems of Whole Genome Comparison: A Review Biochemistry 218 Project Patrick Chain [email protected] With the continuing improvements in high-throughput genomic sequencing and the ever-expanding sequence databases, new advances in software programs for post-sequencing functional analysis are being demanded by the general scientific community. Whole genome comparisons have been heralded as the next logical step toward solving genomic puzzles, such as determining coding regions, discovering regulatory signals, and deducing the mechanisms and history of genome evolution. However, before any such detailed analyses can be addressed, methods are required for comparing (alignments) and displaying (visualization tools) such large sequences. These two topics are reviewed herein. Sequencing: Too fast? The output of sequence data from world-wide sequencing centers with constantly increasing sequencing capacities has been rising at an exponential rate for the past decade or two (see http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html). The first two publications of microbial whole genome sequencing projects were published in 1995. Only six and a half years later, there are almost 60 completed, annotated genomes available (most of them eubacteria and archaebacteria but also a yeast), along with draft analyses of several multi-cellular eukaryotes such as nematode, fly, man, weed, mouse, rat and fish. More still are currently underway. The increase in sequencing efficiency means that the bottleneck is not the accumulation of raw data but the annotation and analysis of sequences and genomes. Annotation: Not so fast? One of the primary goals in analyzing complete genomes is to identify all the functional regions in the sequence, including genes and regulatory regions. Gene finding is relatively straightforward for compact microbial genomes due to very small intergenic regions, whereas the “signal-to-noise” ratio for more complex eukaryotic genomes makes gene prediction extremely difficult. Bacterial genomes consist mostly (85-95 %) of coding sequence, the human genome encodes only ~3%, while a vast array of eukaryotic organisms have coding potentials that lie between these two extremes. There are two computational strategies for identifying genes: 1) extrinsic methods that take advantage of the repository of known or proposed genes and proteins through database similarity searches (for most of the bacterial genomes, roughly 70 % of the annotated genes of any one genome have homologues in other species), and 2) intrinsic (ab initio or de novo) methods that use probabilistic Hidden Markov Models to predict protein coding regions (these models incorporate into their decision-making the statistical patterns of nucleotide ordering within encoding regions - genome features such as relative amino acid, codon usage, and dicodon frequencies). These programs include CRITICA (Badger and Olsen 1999), GLIMMER (Salzberg et al. 1998, Delcher et al. 1999a), GENMARK (Borodovsky et. al. 1993), GRAIL (Uberbacher and Mural 1991, Xu et al. 1994), and GENSCAN (Burge and Karlin 1997). Automated gene and gene function predictions, although an indispensable requirement for genome sequencing projects, have been the subject of great controversy (Devos and Valencia 2001, Kyrpides and Ouzounis 1999, Galperin and Koonin 1998, Brenner 1999, Dandekar et al. 2000). For example, only one month after the release of3the Haemophilus influenzae genome (Fleischmann et al. 1995), 148 amendments to the annotation were published by separate authors (Casari et al. 1995). Since these types of false predictions are misleading and tend to be perpetuated to other genomes, appropriate and accurate annotation techniques must not be underscored. Comparative Genomics: To the Rescue The potential for cross-species comparison to help reveal conserved coding regions as well as other regions of potential biologic function has only recently become clear. The use of alignment-based comparisons to uncover conserved functional elements has been termed “phylogenetic footprinting” (Tagle et al. 1988). Of importance to annotation, this approach obviates the need for a priori knowledge of a sequence motif and provides a complement for algorithmic analyses. It is generally believed that homologous genes are relatively well preserved, while non-coding regions tend to show varying degrees of conservation. Non-coding regions that do show conservation are thought important for regulating gene expression, maintaining the structural organization of the genome and possibly have other, yet unknown functions. Several comparative sequence analysis approaches using alignments have recently been used to analyze corresponding coding and non-coding regions from different species, although mainly between human and mouse (Hardison et al. 1997, Lund et al. 2000, Batzoglou et al. 2000, Kent and Zahler, 2000a, Dubchak et al. 2000, Jareborg et al. 1999, Stojanovic et al. 1999, Gelfand et al. 2000). Of course, the utility of cross-species comparative genomics in the identification of such regions is greatly influenced by the evolutionary distance of the species in question. Comparative analysis of a number of phylogenetically diverse genomes may provide clues about the selective pressures governing gene/operon clustering and may offer insights into mechanisms of evolution or show patterns in acquisition of foreign material via horizontal gene transfer. Genome comparisons of more closely related species may also help determine the genetic basis for phenotypic variation and may reveal species-specific regions (signatures) that can be targeted for identification. Detection techniques based on knowledge of such regions has recently proven fruitful for forensics analysis in the recent

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford STATS 345 - Examining the Current Problems of Whole Genome Comparison - A Review

Sign up for free to view:

Please select your school