DOC PREVIEW
ODU CS 791 - CODING GENES

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

© 2002 Nature Publishing Group698 | SEPTEMBER 2002 | VOLUME 3 www.nature.com/reviews/geneticsREVIEWSannotated genes in the REFSEQ database (~17,000),nearlyhalf are from such large-scale cDNA sequencing pro-jects.Given that expressed sequence tags (ESTs) aremost often generated from highly expressed transcripts,ab initio gene-prediction approaches need to combineseveral sources of information, such as from compar-isons of human and mouse sequences,to discover newgenes or rare transcripts. It is clear that further improve-ments to gene prediction are much needed.Even if, oneday,all human genes were determined experimentally,itwould still be important to understand how the struc-tures of genes are organized and defined, and how theycan be recognized.The ability to predict a gene structureis both an intellectual and a practical challenge.Because those interested in gene-predictionapproaches come from both biological and computa-tional backgrounds,this review has been written for abroad audience.It provides background informationand a survey of the latest developments in gene-prediction programs.It also highlights the problemsthat face the gene-prediction field and discusses futureresearch goals. I hope to stimulate the best minds inboth camps,so that new and creative gene-predictionmethods will be developed.Although the accuracy ofgene prediction has been steadily improving,the basicalgorithms that underlie the various approaches havechanged little since 1997.Although there have beenBiology has entered the genomic era. The celebrateddraft human genome is already one year old, and a pub-licly available draft of the mouse genome has recentlybeen assembled (see links to the Ensembl mousegenome server and the University of Santa CruzGenome Bioinformatics site).At the time of writing,whole-genome sequences for more than 800 organisms(bacteria,archaea and eukaryota, as well as many virusesand organelles) are either complete or being determined(see link to Entrez genome).Driven by this explosion ofgenome data,gene-finding programs have also prolifer-ated,particularly those that are designed for specificorganisms.However,the accuracy with which genes canbe predicted is still far from satisfactory: although, at thenucleotide level,80% of genes are accurately predicted,at the exon level only 45% are predicted,and at thewhole-gene level only ~20%.This is why estimates ofthe number of genes in the human genome are stillimprecise (ranging from 30,000 to 100,000 genes).At present, the annotation of most human genes isbased on cDNA sequence data. Systematic ‘full-length’cDNA sequencing programs, such as those at theMammalian Gene Collection (MGC) in the USA and atRIKEN (The Institute of Physical and ChemicalResearch) in Japan,are generating vitally importantexperimental data towards defining complete gene sets for the human and mouse genomes.Of the best-COMPUTATIONAL PREDICTION OF EUKARYOTICPROTEIN-CODING GENESMichael Q.ZhangThe human genome sequence is the book of our life. Buried in this large volume are our genes,which are scattered as small DNA fragments throughout the genome and comprise a smallpercentage of the total text. Finding these indistinct ‘needles’ in a vast genomic ‘haystack’ canbe extremely challenging. In response to this challenge, computational prediction approacheshave proliferated in recent years that predict the location and structure of genes. Here, I discussthese approaches and explain why they have become essential for the analyses of newlysequenced genomes.Watson School of BiologicalSciences, Cold SpringHarbor Laboratory,1 Bungtown Road,PO Box 100,Cold Spring Harbor,New York 11724, USA.e-mail: [email protected]:10.1038/nrg890REFSEQThe NCBI Reference Sequenceproject (RefSeq) providescurated gene, mRNA andprotein sequences that reflectcurrent knowledge about asequence and its function, andthat are available in the GenBankand NCBI databases.© 2002 Nature Publishing GroupNATURE REVIEWS | GENETICS VOLUME 3 | SEPTEMBER 2002 | 699REVIEWSall gene-prediction papers refer to four types of ‘exon’, asshown in FIG. 2b; however, these are just the codingregions of the exons. To avoid the misuse of these terms,I refer to subclasses of exons in this article as 5′ CDS,itexon,3′ CDS and intronless CDS.Finding internal coding exonsTo determine exon–intron organization,an attempt canbe made to detect either the introns or the exons.In earlystudies of pre-mRNA splicing, short splicing signals wereidentified in introns (FIG.3):the donor site (5′ splice siteor 5′ ss), which is characterized by the consensusAG|GURAGU; the acceptor site (3′ ss),which is charac-terized by the consensus YYYYYYYYYYNCAG|G;andthe less-conserved branch site,which is characterized byCURAY10.These genetic elements direct the assembly ofthe SPLICEOSOME by base pairing with the RNA compo-nents of the splicing apparatus, which carries out thesplicing reaction (FIG.3).Where short introns,which aremostly found in lower eukaryotes (such as yeast), occur,the intron seems to be recognized molecularly by theinteraction of the splicing factors, which bind to bothends of it.Such intron-based gene-structure predictionhas also been used in some computer algorithms (forexample, POMBE in REF.11). Recently, however, Lim andmany good reviews on this topic,and useful bench-marks in the research (for example,REFS 1–8),a trulyfair comparison of the prediction programs is impos-sible as their performance depends crucially on thespecific TRAINING DATA that are used to develop them.Gene structure and exon classificationThe main characteristic of a eukaryotic gene is the orga-nization of its structure into exons and introns (FIG.1).Generally, all exons can be separated into four classes:5′ exons, internal exons, 3′ exons and intronless exons(or, simply, intronless genes) (FIG.2). They can be furthersubdivided into 12 mutually exclusive subclasses,according to their coding content (FIG.2a),and it hasbeen shown that these subclasses have different statisti-cal properties9.Because a vertebrate gene typically hasmany exons,internal coding exons (itexons, or internaltranslated exons) compose the main subclass that hasbeen the focus of all gene-prediction programs.However, the definition of the term ‘exon’has becomeconfused,either unintentionally (due to lack of knowl-edge) or intentionally (for convenience).This confusionhas led to the term ‘exon’being used interchangeablywith the term ‘coding


View Full Document

ODU CS 791 - CODING GENES

Documents in this Course
Load more
Download CODING GENES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CODING GENES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CODING GENES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?