UMD CMSC 838T - Parallel Computation in Biological Sequence Analysis - D1319182

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Parallel Computation in Biological Sequence Analysis

DOC PREVIEW

UMD CMSC 838T - Parallel Computation in Biological Sequence Analysis

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 3, MARCH 1998 1F:\LIBRARY\TRANS\PRODUCTION\TPDS\2-INPROD\100177\100177_1.DOC regularpaper97.dot KSM 19,968 01/09/98 4:10 PM 1 / 12Parallel Computationin Biological Sequence AnalysisTieng K. Yap, Ophir Frieder, Senior Member, IEEE,and Robert L. Martino, Member, IEEEAbstract—A massive volume of biological sequence data is available in over 36 different databases worldwide, including thesequence data generated by the Human Genome project. These databases, which also contain biological and bibliographicalinformation, are growing at an exponential rate. Consequently, the computational demands needed to explore and analyze the datacontained in these databases is quickly becoming a great concern. To meet these demands, we must use high performancecomputing systems, such as parallel computers and distributed networks of workstations. We present two parallel computationalmethods for analyzing these biological sequences. The first method is used to retrieve sequences that are homologous to a querysequence. The biological information associated with the homologous sequences found in the database may provide importantclues to the structure and function of the query sequence. The second method, which helps in the prediction of the function,structure, and evolutionary history of biological sequences, is used to align a number of homologous sequences with each other.These two parallel computational methods were implemented and evaluated on an Intel iPSC/860 parallel computer. The resultingperformance demonstrates that parallel computational methods can significantly reduce the computational time needed to analyzethe sequences contained in large databases.Index Terms—Sequence, comparison, alignment, search, retrieval, database, algorithm, parallel, speculative, computation.—————————— ✦ ——————————1 INTRODUCTIONHE field of molecular biology has created many spe-cialized databases which contain diverse information,such as annotated biological sequences, three-dimensionalmolecular structures, and both genetic and physical maps.Keen et al. [31] have compiled a list of molecular biologydatabases (LiMB database) which presently contains 189entries and is continuing to grow. The LiMB database listed36 sequence databases, including the internationally well-known DNA sequence databases GenBank [5], EMBL [17],and DDBJ [15], and the protein sequence databases PIR[19], SWISS-PROT [1], and PDB [7].Given the great deal of knowledge that can be derivedfrom analyzing biological sequences, we developed parallelmethods to reduce the time required to perform two com-putationally intensive analyses: homologous sequencesearching and multiple sequence alignment. Our parallelsearching method reduces the retrieval time by nearly afactor of N, where N is the number of processors. As shownin this paper, a retrieval on the GenBank that took ap-proximately 168 hours using a single processor was re-duced to only 2.6 hours using 64 processors. Our parallelalignment method also reduces the computation time sig-nificantly. Again, as shown in this paper, to align 324 pro-tein sequences, it took about 29 days on a single processor,but only about 13 hours on a 64 processor system. Theseparallel computational methods are also highly scalable.That is, they can be implemented on either a small or alarge number of processors. They can also be implementedon either a parallel computer or a network of workstations.Retrieving homologous sequences from existing databasesis important to the biomedical research community. Whenscientists discover new sequences, they are eager to searchthese databases for sequences that are similar or related tothese discoveries. The biological information associated withthe similar sequences found in the databases may provideimportant clues for determining the structure and function ofthe newly discovered ones. The detail similarity patterns ofthe retrieved sequences can be seen in a multiple sequencealignment, which is obtained by stacking up sequences ontop of each other. A minimal number of gaps are introducedin the sequences so that the number of matches is maximizedaccording to a given optimization function. This analysis hasbeen used successfully to predict the function, structure, andevolutionary history of biological sequences. Multiple se-quence alignment is also used in AIDS and cancer studies,where it has, for example, been used to develop AIDS vac-cine components [48].The remainder of this paper is organized as follows:First, we present algorithms used to perform biological se-quence analysis. Then, we present the parallel computa-tional methods for retrieving homologous sequences fromlarge biological databases. In the following section, we pre-sent heuristic methods for aligning multiple sequences. Weconclude by summarizing the results of our methods in thelast section.1045-9219/98/$10.00 © 1998 IEEE¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥• T.K. Yap and R.L. Martino are with the National Institutes of Health,Division of Computer Research and Technology, Bldg. 12A Room 2033, 12South Dr. MSC 5624, Bethesda, MD 20892. E-mail: [email protected], [email protected].• O. Frieder is with the Division of Electrical and Computer Science andEngineering, Florida Institute of Technology, 150 W. University Blvd.,Melbourne, FL 32901-6988. E-mail: [email protected] received 10 Apr. 1996; revised 12 Sept. 1997.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 100177.T2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 3, MARCH 1998F:\LIBRARY\TRANS\PRODUCTION\TPDS\2-INPROD\100177\100177_1.DOC regularpaper97.dot KSM 19,968 01/09/98 4:10 PM 2 / 122 BIOLOGICAL SEQUENCE ANALYSIS ALGORITHMSTo compare two DNA, RNA, or protein sequences, thesimilarities between the individual residues (nucleic oramino acids) in these sequences must first be defined. Asubstitution scoring matrix that defines the similarity scorebetween all possible pairs of residues is required. Two dif-ferent types of matrices are used: one for comparing thenucleic acid sequences and the other for the amino acidsequences. These matrices contain the score for substitutingor exchanging one residue by another. The substitutionmatrices for amino acids were described in detail by Day-hoff et al. [12]. These matrices were

View Full Document