Parallel Computation in Biological Sequence AnalysisMotivationTalk OverviewTechniques – similarity sequence searchingSlide 5Slide 6Slide 7Slide 8Techniques – multiple sequence alignmentSlide 10Slide 11Slide 12Slide 13Slide 14ObservationsThank you!Parallel Computation in Biological Sequence AnalysisXue WuCMSC 838 PresentationCMSC 838T – PresentationMotivationScanning and analyzing biological sequences are common and repeated tasks in molecular biologyHomologous sequence searchingBased on pairwise alignmentTask is to find similarities between a particular query sequence and all the sequences of a biosequence databankMultiple sequence alignmentSimultaneous alignment of three or more nucleotide or amino acid sequencesProblems with sequential solutionWith the exponential growth of the biosequence banks, homologous sequence searching becomes time consumingThe automatic generation of an accurate multiple alignment is computationally expensiveParallel solution can reduce computation time and provide more accurate resultCMSC 838T – PresentationTalk OverviewOverview of talkMotivationTechniques and EvaluationsSimilarity Sequence SearchingMultiple Sequence AlignmentObservationsCMSC 838T – PresentationTechniques – similarity sequence searchingTwo main parallel methods to search sequence databaseFine grain approach for SIMD parallel computerParallelize the comparison algorithm itselfAll processors cooperate to determine the similarity scoreCoarser grain approach for MIMD parallel computerParallelize the database searchingEach processor performs a selected number of comparisonmethod used in the paperParallelize Similarity Searching – coarser grain approachWorkload balancing is the key point for better parallelismPartition database, combine results from sequential search for each database requires equal-sized pieces of database for load balancePercentage of Load ImBalance (PLIB) as metric for load imbalance100argargestLoadLadSmallestLoestLoadLPLIBCMSC 838T – PresentationTechniques – similarity sequence searchingSplitting up databaseUnsorted portion method – first load balancing techniquePartition the database into a number of portionsPortion_size = database_size / processors_numberIf sequence assignment causes sum of sequence lengths in portion P exceed ideal size by more than X percent, reassign the sequence to portion P+1Low communication overhead, but possibly high PLIBSorted portion method – Master-worker methodSequences are sorted in decreasing length order to minimize PLIBThe master processor distributes the sequences to the worker processors dynamicallyLow PLIB, but high communication overheadCMSC 838T – PresentationTechniques – similarity sequence searchingProposed bucket methodStatically apply sorted portion methodAlgorithmSequences in the database are sorted in decreasing length orderStarting from the longest-length sequence, place the sequences in N buckets. For each sequence,–Find the sum of the sequences length in each bucket–Find the bucket with the smallest sum value–Place the sequence in the bucket–In the case of a tie, the smallest numbered bucket is selectedEach of the N processors performs sequence search in its own bucketIf only N/n processors are used, each processor searches n bucketCMSC 838T – PresentationTechniques – similarity sequence searchingEvaluation and comparisonComparison of Bucket and Portion methodComparison of Bucket and Master-worker methodAlgorithms are implemented on the Intel iPSC/860Preprocessing is performed on SPARC station 2Data source is GenBank (release 86.0)Preprocessing overhead is added for Bucket methodCMSC 838T – PresentationTechniques – similarity sequence searchingEvaluation and comparison continuedConclusionsIn all tested cases, proposed Bucket method hasLower PLIB than Portion methodHigher speedup than master-worker methodBucket method has obvious advantage whenSequences length is relatively smallProcessing with large number of processorsCMSC 838T – PresentationTechniques – multiple sequence alignmentSequential Berger-Munson algorithmCalculate alignmentRejectAcceptCMSC 838T – PresentationTechniques – multiple sequence alignmentSequential Berger-Munson algorithmApplied randomized techniques with optimization to iteratively improve the multiple sequence alignmentDescription1. Randomly partition the input sequences into two groups2. Align two groups of sequences instead of individual sequence with alignment score calculated by3. If the new alignment score is higher than the previous one, the alignment is accepted and the gaps are inserted into the sequences accordingly. 4. The modified or unmodified alignment is used as the input for the next iteration. The process is stopped after q consecutive iterations of rejection.jijijijijiQbasubSPMAXS,1,1,,),('vPwSMAXPjijiji,11,1,vQwSMAXQjijiji1,11,, 11 111 1,,,,1 1,,),(),(),(),('KkKkmLlLlmjmjljmikKkLljlikjiYYsubXXsubYXsubYXsubCMSC 838T – PresentationTechniques – multiple sequence alignmentParallel Berger-Munson algorithm with speculative computationConsecutive sequence of rejected iterations are not dependent on each other and can be done in parallelSequential Iteration numberDecision Sequence1 2 3 4 5 6 7…AAAAARAAARRARRRARRRARRRARRRRR…123423453456456756786789891011910111210111213131415161718192021222324252627280123CMSC 838T – PresentationTechniques – multiple sequence alignmentEvaluationMethod: Improve the alignment generated by experts and other program (CLUSTALV)Data SourceThree different groups of immunoglobulin sequences from Kabat Database (Beta Release 5.0)The average sequence lengths of three groups are similarThe number of sequences are different, which in each group is as twice as the previous groupGroup Name Num. of Seqs Avg. Length Initial ScoreCLLC 93 62 1,574,724HHC3 185 65 6,650,831MKL5 324 83 31,393,504CMSC 838T – PresentationTechniques – multiple sequence alignmentEvaluation continuedAlignment score comparisonApparently, Berger-Munson method provides more accurate alignments with sacrifice of computation time, which is not
View Full Document