DOC PREVIEW
UMD CMSC 838T - TurboBLAST

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AAbstract1. Introduction*2. Sequence Similarity Searching and BLAST2.1 General Background2.2 The BLAST Algorithm3. The TurboBLAST System3.1 Strategy3.2 Implementation3.3 The TurboHub System3.4 Task/Database Splitting4. Performance Results5. ReferencesTable 1: Results for Example 3TurboBLAST®: A Parallel Implementation of BLAST Built on the TurboHub R.D. Bjornson, A.H. Sherman*, S.B. Weston, N. Willard, J. Wing TurboGenomics, Inc. Abstract BLAST (Basic Local Alignment Search Tool) is by far the most widely used application for rapid screening of large sequence databases. This paper describes TurboBLAST, a parallel implementation of BLAST suitable for execution on networked clusters of heterogeneous PCs, workstations, or Macintosh computers. 1. Introduction* For biological scientists, the characterization of novel DNA and protein sequences is an extremely important problem. One powerful approach is based on detailed assessments of the similarity or homology between novel sequences and sequences in large databases of previously characterized genes or proteins. Both functional and evolutionary information may be inferred from when such assessments are used carefully. Unfortunately, detailed similarity assessments may be quite difficult and time consuming to obtain, often requiring compute-intensive structural comparisons or human intervention by a skilled biologist at some point in the process. For this reason, it is common to use simpler techniques to screen large databases to identify a modest number of sequences most likely to be worthy of detailed examination. BLAST (Basic Local Alignment Search Tool) [1-3,9,10] is by far the most widely used application for rapid screening of large sequence databases. The inputs to BLAST are a set of input query sequences and a number of DNA or protein databases. For each input query sequence, BLAST determines a group of sequences in the databases that have high-scoring pairwise alignments to the query sequence, where the scoring of alignments is based on the use of a user-specified scoring matrix accounting not only for regions of exact matches, but also for insertions, deletions, and substitutions of DNA bases or amino acids. Since the BLAST algorithm considers both local and global alignments, it can detect regions of high similarity embedded in otherwise unrelated sequences, often identifying sequences that, upon detailed examination, can provide important clues about the nature and function of the query sequence. *Contact author. The National Center for Bioinformatics Information (NCBI) develops and distributes an implementation of BLAST that has become something of a “gold standard” for similarity assessment [2]. When run on modern PCs or workstations, it can process large sequence databases quite quickly. However, sequence databases are exploding in size, growing at an exponential rate that far exceeds the rate of increase in hardware capability (which generally obeys Moore’s Law). As a result, the use of the NCBI BLAST application on a single processor has become too costly, inefficient, and time-consuming for many life science laboratories. To address this problem, we and others have examined the use of parallel computation (e.g., [5,6,13,15]). NCBI itself provides an option that employs multiple threads to accelerate performance on small shared-memory multiprocessors. The common wisdom, and our own experience, is that this works reasonably well for small numbers of processors (2 or 4), but that it does not scale up very well, particularly when multiple databases and small query sequences are used. Others, most notably SGI [6], have developed alternative versions of BLAST that address some of the performance issues in NCBI’s threaded implementation, but such alternatives diverge from NCBI’s code (leading to significant validation questions) and are difficult to keep in synch with the relatively frequent updates provided by NCBI. Even a “perfect” threaded implementation of BLAST, however, would have to face the fact that database growth will require the use of increasingly large and costly multiprocessor machines. Given the widespread availability of powerful, yet cheap commodity PCs, a potentially superior alternative is the use of networked clusters of such machines. To pursue this alternative, we have developed TurboBLAST [15], an accelerated, parallel deployment of NCBI BLAST. We use the term “deployment” rather than “implementation” because TurboBLAST delivers high-performance not by changing the BLAST algorithm, but by coordinating the use of multiple copies of the unmodified serial NCBI BLAST application. As a result, TurboBLAST supports all of the standard variants of the BLAST algorithm supported in NCBI BLAST (blastn, blastp, blastx, tblastn, and tblastx), it provides results that are effectively identical to those obtained with the NCBI application, and it is easy to keep up with new versions of BLAST as they are distributed by NCBI. TurboBLAST is available formany parallel computing environments, from heterogeneous clusters of PCs, workstations, and Macintosh computers, to parallel supercomputers, to the worldwide computing grid. All that is required is that each machine have a Java virtual machine (JVM) and a native executable for NCBI’s BLAST application. In the remainder of this paper, we first provide background on sequence similarity searching in general and provide relevant details about BLAST. Then we turn to TurboBLAST, first describing our general parallelization strategy and then going on to discuss the implementation of TurboBLAST, the underlying TurboHub system on which it is built, and some of the details of the approach to task and database splitting that is among the most important aspects of TurboBLAST. Finally, we conclude by providing a sampling of benchmark results to illustrate the performance achievable with TurboBLAST. 2. Sequence Similarity Searching and BLAST 2.1 General Background In this paper we are concerned with comparisons among sequences of letters representing members of two classes of biological sequences: nucleotide sequences, in which each letter represents one of four DNA bases, and peptide or protein sequences, in which each letter represents one of twenty amino acid residues. There are many different methods for comparing such sequences. Some methods, such as those based on the analysis of transformational grammars (cf. [9], Chapter 9), compare sequences


View Full Document

UMD CMSC 838T - TurboBLAST

Documents in this Course
Load more
Download TurboBLAST
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view TurboBLAST and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view TurboBLAST 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?