UMD CMSC 838T - A Study of GeneWise with the Drosophila Adh Region

Unformatted text preview:

Yi Mo, Moira Regelson, and Mike SieversParacel Inc., Pasadena, CAABSTRACTSensitivity and SpecificityEffects of Algorithmic DifferencesPotential Annotation Errors in the std3 DatasetACKNOWLEDGMENTA Study of GeneWise with the Drosophila Adh Region Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CAABSTRACT GeneWise is one of the most accurate computer programs for gene finding, but unfortunately it is very computationally expensive. Paracel has accelerated GeneWise on its sequence analysis supercomputer, GeneMatcher™. In this study, the performance and scientific validity of Paracel GeneWise (PGW) were assessed by comparing PGW to software GeneWise (SGW), using the Drosophila Adh region as the benchmark. For an equivalent search, PGW running on GeneMatcher2 achieved a speed 2735 times faster than that of SGW running on a single 700-MHz Pentium III processor, yielding effectively the same results. A search was performed for all Pfam hits in the Adh sequence, comparing PGW to a heuristically accelerated GeneWise (HAGW) approach called "HalfWise." HalfWise uses BLASTX to select potential Pfam hidden Markov models (HMMs) for further analysis with the more computationally expensive GeneWise. The PGW approach had a sensitivity and specificity up to 87% and 88%, respectively, for identifying Pfam HMM hits, compared to 59% and 93% with the HAGW approach. The exceptional speed and proven scientific validity of Paracel GeneWise make it an indispensable tool for annotations in the genomic era. INTRODUCTION With the onslaught of floods of genomic DNA sequence data, including the human genome (Venter et al., 2001; International Human Genome Sequencing Consortium, 2001), the need for computational tools to rapidly and accurately annotate genomes is ever more pressing. Among various software programs for gene finding and genome annotation in large DNA sequences, GeneWise (Birney and Durbin, 2000; http://www.sanger.ac.uk/Software/Wise2) stands out as one of the most accurate (Guigo et al., 2000). GeneWise is a protein-homology based program using hidden Markov models (HMMs) for finding genes in genomic DNA sequences. By incorporating a protein profile-HMM and a model of DNA splice sites, GeneWise finds the best gene structure prediction and, simultaneously, the alignment of the genomic sequence to the protein profile-HMM or protein sequence. However, GeneWise is a very computationally expensive dynamic program, so it is not yet used very widely for large-scale genome annotations. In order to make GeneWise a more practical tool in the genomic age, Paracel has implemented and vastly accelerated the algorithm on GeneMatcher™, a supercomputer for biological sequence analyses. To assess the performance of Paracel GeneWise (PGW) relative to software GeneWise (SGW), and its scientific validity relative to heuristically accelerated GeneWise (HAGW), we have evaluated each approach with a genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region. This region has been extensively studied and was used in the Genome Annotation Assessment Project (GASP) (Reese et al., 2000). In this evaluation, we have focused on finding all Pfam (Bateman et al., 2000) protein profile-HMMs that occur in the Adh genomic sequence, a study similar to one done by Birney and Durbin (2000). Pfam is a database of protein profile-HMMs and multiple sequence alignments for protein domains and families. It is widely used for genome annotation because of the functional information that can be inferred from similarities to protein domains and families. 1METHODS Searches with HAGW For HAGW searches we used a Perl script, halfwise.pl (included in the Wise2 software package: http://www.sanger.ac.uk/Software/Wise2), to reduce the computational cost of running GeneWise. The HalfWise approach consists of two steps. In the initial step, HalfWise uses BLASTX to search DNA sequences against a protein database (halfwise.db) consisting of the seed alignments of models in the Pfam 5.5 database. The results of this search are used to select Pfam HMMs with possible hits. In step two, these models are used with the more computationally intensive GeneWise database search and sequence alignment algorithm. To further reduce the computational requirements of our test, the DNA sequence was split into segments of 100 kb. Except for the terminal piece, which contained the last 100 kb of the Adh sequence, these segments did not overlap. We performed two HAGW searches using "halfwise.pl" with the database of Adh sequence segments. We performed the initial BLASTX filter step once and selected potential HMMs corresponding to the hits with E-values < 0.001. These HMMs were then used in two GeneWise searches. Both searches were run with the flag "-kbyte" to use all the available physical memory and the "-pthread" option to use multiple CPUs where applicable. The flags "-pretty -gff" were used to ease analysis and 200 alignments were requested. For the first search, we used the "6LITE" algorithm for both the database search and sequence alignment steps of software GeneWise. 6LITE is the simplified version of the GeneWise algorithm implemented on GeneMatcher™ (Regelson et al., 2000). To ensure an equivalent comparison between SGW and PGW we used the flags "-splice flat -alg 6LITE -aalg 6LITE." This search allowed a direct comparison of speed and scores between SGW and PGW. In the second search, a genome annotation approach, we used the 6LITE algorithm for the database search step but performed the sequence alignment step with the 623L algorithm (flag "-aalg 623L"). The 623L algorithm has a looping mode to obtain multiple high-scoring pairs (HSPs) for an HMM query on a single DNA sequence (Birney and Copley, 1999). The score threshold of the database search step was 20 bits (the default) for both HAGW searches. For the results of the second HAGW search we used a bit score threshold of 11 to select hits for further analysis, since the alignment algorithm occasionally produces lower bit scores than the search algorithm. The start and end positions of each hit on an Adh segment were corrected to their positions in the original, intact Adh sequence. For hits on the same DNA strand with overlapping corrected positions, only the hit with the highest bit score was kept for further analysis. Searches with PGW For comparison to the results of the first HAGW search,


View Full Document

UMD CMSC 838T - A Study of GeneWise with the Drosophila Adh Region

Documents in this Course
Load more
Download A Study of GeneWise with the Drosophila Adh Region
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Study of GeneWise with the Drosophila Adh Region and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Study of GeneWise with the Drosophila Adh Region 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?