DOC PREVIEW
UMD CMSC 838T - Parallel Computation in Biological Sequence Analysis

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Parallel Computation in Biological Sequence AnalysisMotivationTalk OverviewTechniques – similarity sequence searchingSlide 5Slide 6Slide 7Slide 8Techniques – multiple sequence alignmentSlide 10Slide 11Slide 12Slide 13Slide 14ObservationsThank you!Parallel Computation in Biological Sequence AnalysisXue WuCMSC 838 PresentationCMSC 838T – PresentationMotivationScanning and analyzing biological sequences are common and repeated tasks in molecular biologyHomologous sequence searchingBased on pairwise alignmentTask is to find similarities between a particular query sequence and all the sequences of a biosequence databankMultiple sequence alignmentSimultaneous alignment of three or more nucleotide or amino acid sequencesProblems with sequential solutionWith the exponential growth of the biosequence banks, homologous sequence searching becomes time consumingThe automatic generation of an accurate multiple alignment is computationally expensiveParallel solution can reduce computation time and provide more accurate resultCMSC 838T – PresentationTalk OverviewOverview of talkMotivationTechniques and EvaluationsSimilarity Sequence SearchingMultiple Sequence AlignmentObservationsCMSC 838T – PresentationTechniques – similarity sequence searchingTwo main parallel methods to search sequence databaseFine grain approach for SIMD parallel computerParallelize the comparison algorithm itselfAll processors cooperate to determine the similarity scoreCoarser grain approach for MIMD parallel computerParallelize the database searchingEach processor performs a selected number of comparisonmethod used in the paperParallelize Similarity Searching – coarser grain approachWorkload balancing is the key point for better parallelismPartition database, combine results from sequential search for each database requires equal-sized pieces of database for load balancePercentage of Load ImBalance (PLIB) as metric for load imbalance100argargestLoadLadSmallestLoestLoadLPLIBCMSC 838T – PresentationTechniques – similarity sequence searchingSplitting up databaseUnsorted portion method – first load balancing techniquePartition the database into a number of portionsPortion_size = database_size / processors_numberIf sequence assignment causes sum of sequence lengths in portion P exceed ideal size by more than X percent, reassign the sequence to portion P+1Low communication overhead, but possibly high PLIBSorted portion method – Master-worker methodSequences are sorted in decreasing length order to minimize PLIBThe master processor distributes the sequences to the worker processors dynamicallyLow PLIB, but high communication overheadCMSC 838T – PresentationTechniques – similarity sequence searchingProposed bucket methodStatically apply sorted portion methodAlgorithmSequences in the database are sorted in decreasing length orderStarting from the longest-length sequence, place the sequences in N buckets. For each sequence,–Find the sum of the sequences length in each bucket–Find the bucket with the smallest sum value–Place the sequence in the bucket–In the case of a tie, the smallest numbered bucket is selectedEach of the N processors performs sequence search in its own bucketIf only N/n processors are used, each processor searches n bucketCMSC 838T – PresentationTechniques – similarity sequence searchingEvaluation and comparisonComparison of Bucket and Portion methodComparison of Bucket and Master-worker methodAlgorithms are implemented on the Intel iPSC/860Preprocessing is performed on SPARC station 2Data source is GenBank (release 86.0)Preprocessing overhead is added for Bucket methodCMSC 838T – PresentationTechniques – similarity sequence searchingEvaluation and comparison continuedConclusionsIn all tested cases, proposed Bucket method hasLower PLIB than Portion methodHigher speedup than master-worker methodBucket method has obvious advantage whenSequences length is relatively smallProcessing with large number of processorsCMSC 838T – PresentationTechniques – multiple sequence alignmentSequential Berger-Munson algorithmCalculate alignmentRejectAcceptCMSC 838T – PresentationTechniques – multiple sequence alignmentSequential Berger-Munson algorithmApplied randomized techniques with optimization to iteratively improve the multiple sequence alignmentDescription1. Randomly partition the input sequences into two groups2. Align two groups of sequences instead of individual sequence with alignment score calculated by3. If the new alignment score is higher than the previous one, the alignment is accepted and the gaps are inserted into the sequences accordingly. 4. The modified or unmodified alignment is used as the input for the next iteration. The process is stopped after q consecutive iterations of rejection.jijijijijiQbasubSPMAXS,1,1,,),('vPwSMAXPjijiji,11,1,vQwSMAXQjijiji1,11,,      11 111 1,,,,1 1,,),(),(),(),('KkKkmLlLlmjmjljmikKkLljlikjiYYsubXXsubYXsubYXsubCMSC 838T – PresentationTechniques – multiple sequence alignmentParallel Berger-Munson algorithm with speculative computationConsecutive sequence of rejected iterations are not dependent on each other and can be done in parallelSequential Iteration numberDecision Sequence1 2 3 4 5 6 7…AAAAARAAARRARRRARRRARRRARRRRR…123423453456456756786789891011910111210111213131415161718192021222324252627280123CMSC 838T – PresentationTechniques – multiple sequence alignmentEvaluationMethod: Improve the alignment generated by experts and other program (CLUSTALV)Data SourceThree different groups of immunoglobulin sequences from Kabat Database (Beta Release 5.0)The average sequence lengths of three groups are similarThe number of sequences are different, which in each group is as twice as the previous groupGroup Name Num. of Seqs Avg. Length Initial ScoreCLLC 93 62 1,574,724HHC3 185 65 6,650,831MKL5 324 83 31,393,504CMSC 838T – PresentationTechniques – multiple sequence alignmentEvaluation continuedAlignment score comparisonApparently, Berger-Munson method provides more accurate alignments with sacrifice of computation time, which is not


View Full Document

UMD CMSC 838T - Parallel Computation in Biological Sequence Analysis

Documents in this Course
Load more
Download Parallel Computation in Biological Sequence Analysis
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Parallel Computation in Biological Sequence Analysis and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallel Computation in Biological Sequence Analysis 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?