Unformatted text preview:

Performance Optimization of Clustal W:Parallel Clustal W, HT Clustal, andMULTICLUSTALDmitri Mikhailov, Haruna Cofer, and Roberto GompertsSGI ChemBio: www.sgi.com/solutions/sciences/chembioIntroductionMultiple sequence alignments represent a class of powerful bioinformatics tools with many uses incomputational biology. Knowledge of multiple alignments (MA) helps to predict secondary and tertiarystructures and to detect homologies between newly sequenced genes and existing gene (protein)families. With the adoption of high-throughput (HT) automation, offering scientists significantly moredata with which to make better decisions, it is increasingly important to run MA calculations as quicklyas possible to allow real-time decision making in the lab. The popular MA application Clustal W (1)provides a very good example of how the resources of multiprocessor shared memory computers can beutilized more efficiently. This paper first outlines the efforts undertaken by SGI to parallelize the ClustalW application. The parallel version shows speedups of up to 10x when running Clustal W on 16 CPUsand significantly reduces the time required for data analysis. Second, the development of ahigh-throughput version of Clustal W called HT Clustal and the different methods of schedulingmultiple MA jobs in HT Clustal are discussed. Third, improvements in the recently introducedMULTICLUSTAL algorithm (2) and its efficient use of Parallel Clustal W are reviewed. BackgroundThe basic idea behind the Clustal W algorithm for building MA is centered around aligning the mostrelated sequences first. By aligning these sequences earlier on, one can minimize instances ofunnecessary gap insertions. In order to identify the similar sequences the Clustal W algorithm estimatesall pairwise combinations for the input protein sequences. For N input sequences one has to estimateN(N-1)/2 pairs of sequences because of the pairwise matrix symmetry. This pairwise (PW) calculation(Fig. 1 left) represents the first stage of the algorithm. This is followed by the construction of a guidetree (GT), the second stage of the Clustal W algorithm. The guide tree contains the most closelymatching sequences (or groups of sequences) located on the same branch (Fig. 1 middle). In this stage, adefault Neighbor Joining (NJ) method is utilized (3), the details of which are beyond the scope of thispaper. Although the Clustal W algorithm offers other methods for tree calculation, NJ is known to bemore robust and therefore is the only tree construction method optimized by SGI. The GT is furtherused as a guide during the third stage progressive alignment (PA), when the most related sequences arealigned first, followed by the alignment of more distant sequences or groups of sequences (profilealignment). The result of PA is the final MA, which can be output in a variety of plain text formats andviewed and analyzed with Clustal X or any other sequence viewer (Fig. 1 right). Fig. 1. Stages of the Clustal W algorithm. Left to right: pairwise calculation (PW), guide tree calculation (GT), andprogressive alignment (PA). Parallel Code OptimizationsThe Clustal W algorithm was parallelized by SGI using OpenMPTM (4) directives. The changes wereadded so that the parallel code is executed only if the user requests more than 1 CPU. Otherwise, theoriginal serial code is executed. Time profile analysis of the original Clustal W, using different numbers of G-protein coupled receptor(GPCR) proteins as inputs (Fig. 2), shows the following time fractions spent in each of the stagesdescribed in the previous section. Fig. 2. Time fractions spent in three Clustal W stages for various input sizes. Total times in minutes are shown on the top ofthe bars. Most of the time is spent in PW stage although the relative fraction is lower (~50%) for larger alignmentscompared to ~90% for smaller alignments. Therefore, this stage needs to be parallelized first. PW Stage OptimizationAs mentioned earlier, during the first stage N(N-1)/2 pairwise comparisons have to be made. Becauseeach comparison is independent of the rest, this part of the algorithm can be easily parallelized with theOpenMP "for" construct: /* SGI pseudocode: Parallelize for pairwise distance matrix calculation */#pragma omp parallel private(i,j){#pragma omp for schedule(dynamic) for (i=Istart;i<Iend;i++) { for(j=i+1;j<Jend;j++) calculate_pairwise_matrix_element(); }} /* End of pragma parallel */ Because the size of the inner j -loop varies, the OpenMP "dynamic" schedule is used in order to avoidload unbalance among different threads. In principle the "static" interleave schedule can be used here aswell, but because each pairwise comparison takes varying amounts of time, the "dynamic" type worksbetter. This implementation is only efficient on a shared memory system like the SGITM Origin 3000TMseries. But no matter how well this stage is parallelized, the scaling would still be limited to a lownumber of processors if no further optimization is done. For example, without an additionalparallelization, the alignment of 1,000 GPCR protein sequences, where PW stage accounts for ~50% (Fig2) of total time, would be only ~1.8x faster when running on 8 CPUs according to Amdahl’s Law (5). Inorder to achieve better scaling for larger, more compute-intensive alignments, additional paralleloptimizations are required for the second and third stages.GT Stage OptimizationIn the second stage (GT calculation), the most time-consuming part is determining the smallest matrixelement corresponding to the next tree branch. This can be done in parallel by calculating and saving theminimum element of each row concurrently and then using the saved minimum row elements to find theminimum element of the entire matrix: /* SGI pseudocode: finding smallest matrix element */ min_row_element = (double *) ckalloc( (Iend - Istart) * sizeof (double)); #pragma omp parallel private(i,j){#pragma omp for schedule(dynamic) for(i=Istart; i < Iend; i++) { for(j=Jstart; j < i; j++) { /* Compute smallest element of row i and store in min_row_element[i] */ } }} /* End of pragma parallel */ min_matrix_element = min_row_element[Istart];for(i=Istart+1; i < Iend; i++) if(min_row_element[i] < min_matrix_element) { min_matrix_element = min_row_element[i]; } Note that the intermediate min_row_element[ ] array is created to store the smallest row elements inthe parallel loop.


View Full Document

UMD CMSC 838T - Performance Optimization of Clustal W

Documents in this Course
Load more
Download Performance Optimization of Clustal W
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Performance Optimization of Clustal W and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Performance Optimization of Clustal W 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?