Recent advances in gene structure prediction Michael R Brent1 and Roderic Guigo 2 De novo gene predictors are programs that predict the exonintron structures of genes using the sequences of one or more genomes as their only input In the past two years dual genome de novo predictors which exploit local rates and patterns of mutation inferred from alignments between two genomes have led to significant improvements in accuracy Systems that exploit more than two genomes simultaneously have only recently begun to appear and are not yet competitive on practical tasks but offer the greatest hope for near term improvements Dual genome de novo prediction for compact eukaryotic genomes such as those of Arabidopsis thaliana and Caenorhabditis elegans is already quite accurate Although mammalian gene prediction lags behind in accuracy it is yielding ever more useful results Coupled with significant improvements in pseudogene detection methods which have eliminated many false positives we have reached the point where de novo gene predictions are being used as hypotheses to drive experimental annotation via systematic RT PCR and sequencing Addresses 1 Laboratory for Computational Genomics Campus Box 1045 Washington University One Brookings Drive St Louis Missouri 63130 USA e mail brent cse wustl edu 2 Research Group in Biomedical Informatics Institut Municipal d Investigacio Me dica Universitat Pompeu Fabra Centre de Regulacio Geno mica Barcelona Catalonia Spain Current Opinion in Structural Biology 2004 14 264 272 This review comes from a themed issue on Sequences and topology Edited by Peer Bork and Christine A Orengo 0959 440X see front matter 2004 Elsevier Ltd All rights reserved DOI 10 1016 j sbi 2004 05 007 Abbreviations EHMM evolutionary HMM EST expressed sequence tag HMM hidden Markov model indels insertions and deletions ORF open reading frame PPT poly pyrimidine tract RT PCR reverse transcription polymerase chain reaction TSS transcription start sites UTR untranslated region Introduction The past two years have seen the flowering of the genomic era a period during which metazoan genome sequencing has been transformed from a major internaCurrent Opinion in Structural Biology 2004 14 264 272 tional event to a common undertaking that barely makes the covers of scientific journals much less popular newspapers The wealth of raw data generated by this technological triumph has greatly accelerated scientific progress even while it remains far from fully analyzed It has also driven a series of advances in computational genome analysis including methods for predicting the exon intron structures of genes Such methods can be divided into those that make use of expression data including sequences from cDNAs and potentially data from hybridization experiments and those that use only the sequences of one or more genomes de novo or ab initio methods The focus of this review is recent developments in de novo gene prediction for the genomes of higher eukaryotes De novo gene predictors can be categorized into those that use a single genome sequence those that use two genome sequences to infer local rates and patterns of mutation along the genome and those that use more than two genomes for the same purpose Single genome predictors reached a state of relative maturity with the development of systems based on hidden Markov models HMMs e g GENSCAN 1 GENIE 2 and HMMGENE 3 and related models e g GENEID 4 and FGENESH 5 Dual genome de novo predictors e g SGP 2 6 SLAM 7 and TWINSCAN 8 9 have led to the greatest practical improvement in the accuracy of prediction over the past two years Systems that exploit more than two genomes simultaneously e g 10 11 have only recently begun to appear and are not yet competitive on practical tasks but offer the greatest hope for nearterm improvements in accuracy Since the first animal and plant genomes were sequenced de novo gene finders have been part of the standard toolbox for genome annotation and analysis With the advent of dual genome predictors the accuracy for compact genomes such as that of Arabidopsis thaliana has become so good that one half to two thirds of all known genes are predicted exactly right from the start codon through every splice site to the stop codon and most of the imperfect predictions are only slightly off 12 Chaochun Wei personal communication The accuracy for mammalian genomes has lagged behind owing to inherent challenges such as the large number of pseudogenes and small fraction of coding sequence that affect all mammalian annotation methods Although dualgenome de novo systems now correctly predict about 75 of all known exons at both splice sites only 15 20 of known gene structures are predicted correctly throughout the coding region 6 9 Annotation www sciencedirect com Gene structure prediction Brent and Guigo 265 pipelines such as ENSEMBL 13 which require homology to known expressed sequences are somewhat more accurate at predicting exons of known genes 9 but they tend to miss many predicted exons and genes that can be verified experimentally 14 15 16 Perhaps the most significant development of the past year in mammalian annotation has been the application of recently developed pseudogene detection methods 17 18 which have eliminated many false positives from both de novo and pipeline style annotation Indeed the advent of dualgenome systems together with the elimination of many pseudogenes has improved the de novo prediction accuracy to the point where systematic reverse transcriptionpolymerase chain reaction RT PCR and sequencing of de novo predictions is a cost effective complement to sequencing of random cDNA clones even in mammalian genomes 19 Recent advances for single genome predictors have focused on the problem of training and parameter estimation Often newly sequenced genomes lack large enough samples of known genes from which to estimate model parameters Although in such cases it is common practice to use a genome predictor that has been trained on another species recent analysis indicates that gene finders trained on a foreign genome tend to perform suboptimally I Korf personal communication To address this limitation Korf introduces the idea of bootstrap parameter estimation in which a foreign gene finder is run on a novel genome and the resulting predictions are used to estimate the parameters for gene prediction for the novel genome Kotlar and Lavner 21 pursue a different strategy that emphasizes detecting periodic
View Full Document