DOC PREVIEW
CMU BSC 03510 - Lecture Notes
Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Evaluation of Gene-Finding Programson Mammalian SequencesSanja Rogic,1Alan K. Mackworth,2and Francis B.F. Ouellette31Computer Science Department, The University of California at Santa Cruz, Santa Cruz 95064, California;2ComputerScience Department, The University of British Columbia, Vancouver, BC V6T 1Z4, Canada;3Centre for Molecular Medicineand Therapeutics, Vancouver, BC V5Z 4H4, CanadaWe present an independent comparative analysis of seven recently developed gene-finding programs: FGENES,GeneMark.hmm, Genie, Genscan, HMMgene, Morgan, and MZEF. For evaluation purposes we developed anew, thoroughly filtered, and biologically validated dataset of mammalian genomic sequences that does notoverlap with the training sets of the programs analyzed. Our analysis shows that the new generation ofprograms has substantially better results than the programs analyzed in previous studies. The accuracy of theprograms was also examined as a function of various sequence and prediction features, such as G + C content ofthe sequence, length and type of exons, signal type, and score of the exon prediction. This approach pinpointsthe strengths and weaknesses of each individual program as well as those of computational gene-finding ingeneral. The dataset used in this analysis (HMR195) as well as the tables with the complete results are available athttp://www.cs.ubc.ca/∼rogic/evaluation/.Currently, in genome centers around the world, mil-lions of bases of genomic DNA from different organ-isms are sequenced every day. With the recently as-sembled draft sequence of the human genome in handand the completed sequence to follow in a couple ofyears, we need to re-evaluate our methods for decipher-ing such an enormous amount of data.We present here the results of a comprehensiveevaluation of recent computer programs used for theidentification of protein coding genes in eukaryotic ge-nomic sequences. Because we expect that such ananalysis will be of interest to both biologists and com-puter scientists, we will first provide an overview ofgene structure and computational methods for gene-finding.Gene StructureThe genes of most eukaryotic organisms are neithercontinuous nor contiguous. They are separated by longstretches of intergenic DNA and their coding se-quences are interrupted by noncoding introns. Codingsequences occupy just a small fraction of a typicalhigher eukaryotic genome; the extreme example is thehuman genome, where an estimate of that fraction at3% (Duret et al. 1995) was recently confirmed for chro-mosome 22 (Dunham et al. 1999). To obtain a continu-ous coding sequence which will be translated into aprotein sequence, genes are transcribed into long pre-mRNA molecules that subsequently undergo complexprocessing to remove intronic sequences and assembleexons to form mRNA. However, assembly of the geneexons in the mature mRNA is not always the same;Mironov et al. (1999) found that at least 35% of humangenes are alternatively spliced—having more than onepossible exon assembly. The arrangement of genes ingenomes is also prone to exceptions. Although usuallyseparated with an intergenic region, there are examplesof genes nested within each other (Dunham et al.1999); that is, one gene located in an intron of anothergene or overlapping genes on the same (Schulz andButler 1989; Ashburner et al. 1999) or opposite (Cooperet al. 1998) DNA strands. The presence of pseudogenes(nonfunctional sequences resembling real genes)which are distributed in numerous copies throughoutthe genome further complicates the identification oftrue protein coding genes.Regulatory regions play a crucial role in gene ex-pression, and their identification is needed to fullycomprehend a gene’s function, activity, and role incellular processes. The location of regulatory regionsrelative to their target gene is not uniquely deter-mined; the basic regulatory elements, such as the TATAand CAT boxes, are usually found in the upstreamproximity of the transcription start site, while theother elements such as enhancers and silencers, can belocated in distant upstream and downstream regions ofa gene and sometimes even within the introns of thegene.This brief overview of genome organization andgene architecture highlights the complexity of geneidentification in the sequences of uncharacterizedDNA.5Corresponding author.E-MAIL: [email protected]; FAX: (831) 459–4046.Article and publication are at www.genome.org/cgi/doi/10.1101/gr.147901.Letter11:817–832 ©2001 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org Genome Research 817www.genome.orgComputational Methods for Identification of GenesThere are several methods used for the experimentaldiscovery of genes, but they are time-consuming andcostly. Accordingly, for the last 15 years researchershave been developing computational methods forgene-finding that can automate, or facilitate, the iden-tification of genes. Two basic approaches have beenestablished for computational gene-finding: the se-quence similarity search, or lookup (Fickett 1996),method and the integrated compositional and signalsearch, or template (Fickett 1996), method. The lattermethod is also commonly referred to as ab initio genefinding.Sequence similarity search is a well-establishedcomputational method for gene discovery which hasbeen used extensively with considerable success. It isbased on sequence conservation due to the functionalconstraints and is used to search for regions of similar-ity between an uncharacterized sequence of interestand already characterized sequences in a public se-quence database. Significant similarity between two se-quences suggests that they are homologous, that is,they have common evolutionary origin. A query se-quence can be compared with DNA, protein, or ex-pressed sequence tag (EST) sequences or it can besearched for known sequence motifs. If a query se-quence is found to be significantly similar to an al-ready annotated sequence (DNA or protein), we canuse the information from the annotated sequence topossibly infer gene structure or function of the querysequence. Comparison with an EST database can pro-vide information if the sequence of interest is tran-scribed, that is, contains an expressed gene, but willonly give incomplete clues about the structure of thewhole gene or its function.Although sequence similarity search has beenproven useful in many cases, it has been shown thatonly a fraction of newly discovered sequences haveidentifiable homologs in


View Full Document

CMU BSC 03510 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?