DOC PREVIEW
Stanford CS 262 - BMC Bioinformatics

This preview shows page 1-2-3-4-5 out of 15 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 15 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AbstractBackgroundResultsConclusionsBackgroundMethodsAutomatic extraction of training setHMM architectureModel estimationDecodingSignificanceUsing other gene findersResults and DiscussionTable 1Table 2Table 3ConclusionsTable 4Appendix: The length dependent score distributionWeb sites usedAcknowledgementsAcknowledgementsReferencesBioMed CentralPage 1 of 15(page number not for citation purposes)BMC BioinformaticsOpen AccessResearch articleEasyGene – a prokaryotic gene finder that ranks ORFs by statistical significanceThomas Schou Larsen*1,2 and Anders Krogh1,3Address: 1Center for Biological Sequence Analysis BioCentrum, Technical University of Denmark Building 208, 2800 Lyngby, Denmark, 2Present address: Novozymes A/S, Novo Alle, 1B1.01,2800 Bagsvaerd, Denmark and 3Present address: The Bioinformatics Centre, University of Copenhagen Universitetsparken 15, 2100 Copenhagen, DenmarkEmail: Thomas Schou Larsen* - [email protected]; Anders Krogh - [email protected]* Corresponding author computational gene findingstatistical significancehidden Markov modelshort open reading framesautomated genome annotationAbstractBackground: Contrary to other areas of sequence analysis, a measure of statistical significance ofa putative gene has not been devised to help in discriminating real genes from the masses of randomOpen Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too manyshort ORFs annotated as genes.Results: In this paper, we present a new automated gene-finding method, EasyGene, whichestimates the statistical significance of a predicted gene. The gene finder is based on a hiddenMarkov model (HMM) that is automatically estimated for a new genome. Using extensions ofsimilarities in Swiss-Prot, a high quality training set of genes is automatically extracted from thegenome and used to estimate the HMM. Putative genes are then scored with the HMM, and basedon score and length of an ORF, the statistical significance is calculated. The measure of statisticalsignificance for an ORF is the expected number of ORFs in one megabase of random sequence atthe same significance level or better, where the random sequence has the same statistics as thegenome in the sense of a third order Markov chain.Conclusions: The result is a flexible gene finder whose overall performance matches or exceedsother methods. The entire pipeline of computer processing from the raw input of a genome or setof contigs to a list of putative genes with significance is automated, making it easy to apply EasyGeneto newly sequenced organisms. EasyGene with pre-trained models can be accessed at http://www.cbs.dtu.dk/services/EasyGene.BackgroundAs of February 2003, 106 microbial genomes have beensequenced and made publicly available and the race isnow on to mine genomes such as these for interestingand/or valuable genes and motifs. It has been estimated[1] that 60–80% of the genes in newly sequenced organ-isms have known homologues in other species. This per-centage will grow as genomic annotations progress andperhaps there will be a time when virtually all genes canbe found by homology matches to known proteins. Thatday, however, is not around the corner and even if it were,the occasional odd genes which would neverthelessPublished: 3 June 2003BMC Bioinformatics 2003, 4:21Received: 25 November 2002Accepted: 3 June 2003This article is available from: http://www.biomedcentral.com/1471-2105/4/21© 2003 Larsen and Krogh; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/21Page 2 of 15(page number not for citation purposes)escape detection by homology may very well be the truelynovel and most wanted ones.It is a common misconception that identification of genesin prokaryotes is almost trivial. Any random sequence, aswell as non-coding regions in real genomes, contain alarge number of open reading frames (ORFs). Most ofthese are too short to be possible protein coding genes,but in many genomes there are many 'random' ORFslonger than e.g. 100 amino acids, a cut-off that is oftenused for considering an ORF a real gene. The large numberof short 'random' ORFs makes is difficult to discriminatereal genes from random ORFs below a certain length,which depends on the genome and in particular its GCcontent. Therefore many genomes are over-annotated [2].In one genome, that of A. pernix [3], all ORFs longer than100 amino acids are annotated as genes, but probablyonly around half the annotated genes are real [2]. Themost severe problem today is to disciminate betweenshort genes and random ORFs, and here the meaning of'short' is quite organism dependent. One of the mostimportant contributions of this paper is a way to deal withthis problem by introducing a statistical significance for anORF being a gene.Computational gene finding exploits the statistical differ-ences in codon usage between coding and non-codingregions of DNA [4–6]. The search for a mathematicalframe work to efficiently capture these differences incodon usage led to Markov chain models and the Gene-Mark algorithm [7]. In order to facilitate the combinationof various Markov chain scores, the application of HiddenMarkov Models (HMMs) to gene finding was introducedin a gene finder for E. coli, Ecoparse [8]. These methodsrelied on a set of known genes for estimating parameters.More recently methods have been developed which startfrom a raw genome and automatically extract data for esti-mation. One of these, Glimmer [9], employs interpolatedMarkov models in order to use the maximum Markovchain order which can reliably be estimated for every oli-gomer. Another one, Orpheus [1], appeared the same yearand calculates coding potentials of ORFs based on codonfrequency of similarity-derived genes. Most of these genefinders also extract Shine-Dalgarno sequences in order toimprove prediction of start codons.Due to their modular structure, HMMs are a suitableframe work for gene finding, and they are now used inGeneMark.hmm [10], GeneMarkS [11,12] and Frame-by-Frame [13]. GeneMarkS uses a mixture of Markov chainsof order 0, 1 and 2 in combination with features of alreadyannotated genomes to bootstrap an initial estimation ofcoding statistics, which is then iteratively improved by theGeneMark.hmm2.1 algorithm. Gibbs sampling


View Full Document

Stanford CS 262 - BMC Bioinformatics

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download BMC Bioinformatics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view BMC Bioinformatics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view BMC Bioinformatics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?