UMD CMSC 838T - Prediction of Complete Gene Structures in Human Genomic DNA - D3045782

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Prediction of Complete Gene Structures in Human Genomic DNA

UMD CMSC 838T - Prediction of Complete Gene Structures in Human Genomic DNA

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 18

Download Save

Unformatted text preview:

Prediction of Complete Gene Structures in Human Genomic DNA Chris Burge* and Samuel Karlin Dqmrtnmt of Mathematics We introduce a general probabilistic model oi the gene structure of Sfnl#rd Universify, Stanford human genomic sequences which incorporates descriptions of the basic CA, 94305, USA * transcriptional, translational and splicing signals, as well as length distri- ,butions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived to account for the many substantial differences in gene density and structure observed in distinct C + G compositional regions of the human genome. In addition, new models of the donor and acceptor splice signals are described which capture potentially important dependencies between signal positions. The model is applied to the problem of gene identification in a computer pro- gram, GENSCAN, which identifies complete exon/intron structures of genes in genomic DNA. Novel features of the program include the ca- pacity to predict multiple genes in a sequence, to deal with partial as well as complete genes, and to predict consistent sets of genes occurring on either or both DNA strands. GENSCAV is shown to have substan- tially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes, with 73 to 80% of exons identified exactly. The program is also capable of indicating fairly accurately the re- liability of each predicted exon. Consistently high Iyvels of accuracy are observed for sequences of differing C i G content and for distinct groups of vertebrates. 1997 Academic Press Limited Keywords: exon prediction; gene identification; coding sequence; 'Corresponding nufhm probabilistic model; splice signal Introduction The problem of identifying genes in genomic DNA sequences by computational methods has at- tracted considerable research attention in recent years. From one point of view, the problem is clo- sely related to the fundamental biochemical issues of specifying the precise sequence determinants of transcription, translation and RNA splicing. On the other hand, with the recent shift in the emphasis of the Human Genome Project from physical map- ping to intensive sequencing, the problem has taken on significant practical importance, and com- puter software for exon prediction is routinely used by genome sequencing laboratories (in con- Abbreviations used: Sn, sensitivity; Sp, specificity; CC. correlation coefficient; AC, approximate corrclation; ME, missed exons; WE, wrong exons; snRNP, small nuclear ribon~~clcoprotcin particle; snRNA, small nuclear RNA; WMM, weight nlatris mcxlcl; WAM, weight array model; MDD, nusinla1 dependence dcxmnqxwition. 01)22-2&36/1)7/16C4VSI7 c~.Oo/O/l~b')7(~')51 junction with other methods) to help identify genes in newly sequenced regions. Many early approaches to the problem focused on prediction of individual functional elements, e.g. promoters, splice sites, coding regions, in iso- lation (reviewed by Gelfand, 1995). More recently, a number of approaches have been developed which integrate multiple fqes of information in- cluding splice signal sensors, compositional prop- erties of coding and noModing DNA and in some . cases database homology searching in order to pre- dict entire gene structurrs (sets of spliceable exons) in genomic sequences. Some examples of such pro- grams include:. FGEXEH (Solovyev et d., 1993), GENMARK (Borodovsky & McLninch, 1993), Gene- ID (Guig6 et a?., 1992), Genie (Kulp et nl., 1996), .: GeneParser (Snyder Ei Stormo, 1995), and GRAIL I1 (Xu et nl., 1994). Fickett (1996) offers an up-to- date introduction to sene hding by computer and points up son~e of the strengths and weduw+cs of currently available nwtkxls. Two important linli- tations notcd are that tht. nwjority of current algor- ithms assume that the hput sequence contains s: 1~7 r\cAden\ic I'rcss I.imi trd ;$ & "cexactly ow complcte gc'w (so that, \vhcn prcscntcd with a sequc~~ce cont'3ining a prtial gcnc or mul- tipie gcncs, the results pncr.~lly do not makc sense); and tllnt accurx!. nw,~~rc.d by incicpcn- dent control scts map b~ consicierabiy 1o\h7cr than was originally thought. The issue of the predictive accuracy of such methods has recently been ad- dressed tluough an exhaustive comparison of available methods using a Iarge set of vertebrate gene sequences (Burset & Guigb, 1996). The authors conclude that the predictive accuracy of all such programs remains rather low, with less than 50% of exons identified exactly by most programs. Thus, development of new methods (and/or im- provement of existing methods) continues to be important. Here, we introduce a general probabilistic model for the (gene) .structure of human genomic se- quences and describe the application of this model to the problem of gene prediction in a program called GENSCAN. Our goal in desi,oning the geno- mic sequence model was to capture the general and specific compositional properties of the dis- tinct functional units of a eukaryotic gene: exon, in- tron, splice site, promoter, etc. Emphasis was placed on those features which are recognized by the general transcriptional, splicing and transla- tional machinery ~hich process most or all protein coding genes, rather than specialized signals re- lated to tTanscription or (alternative) splicing of particular genes or gene families. Thus, for example, we incIudy the TATA box and cap site which are present iq most eukaryotic promoters, but not specialized or tissne-specific transcription factor binding sites such as those bound by MyoD (e-g. Lassar et al., 1989). Similarly, we use a general three-periodic (inhomogeneous) fifth-order Markov model of coding regions rather than using special- ized models of particular protein motifs or data base homology information a consequence, predictions made by the program do not depend On presence of a similar gene in the protein se- quence databases, but instead provide information which is independent and complementary to that provided by homology-based gene identification methods such as searching the protein databases with BLASTX (Gish & States, 1993). Additionally, the model takes into

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 838T - Prediction of Complete Gene Structures in Human Genomic DNA

Sign up for free to view:

Please select your school