UMD CMSC 838T - Improving performance of Multiple Sequence Alignment in Multi-client Environments

Unformatted text preview:

Improving performance of Multiple Sequence Alignment in Multi-client EnvironmentsOverviewMotivationCLUSTALW: Progressive global alignmentOptimization: Query cachingChallenges to cache implementationTechnique: 2-level B-Tree cacheSMPSlide 9Evaluation: ImplementationEvaluation: ResultsObservationsImproving performance of Multiple Sequence Alignment in Multi-client EnvironmentsAaron ZollmanCMSC 838 PresentationCMSC 838T – PresentationOverviewOverview of talkCLUSTALW algorithm, speedup opportunitiesProblems with cachingParallelizing techniqueWeaknessesApplying technique to other bioinformatics problemsCMSC 838T – PresentationMotivationQuery overlap in queries submitted to MSA toolsSingle researcher: new sequences vs. databaseMultiple researchers: similar subsetsCLUSTALW: Progressive algorithmThree stepsProgressive refinementOpportunities for speedupCachingQuery orderingCMSC 838T – PresentationCLUSTALW: Progressive global alignmentStep 1: Pairwise alignment, distance matrixFast technique calculates distance between two scoresCalculated for all sequence pairsCost: O(q2l2) Step 2: Guide treeGroup nearest firstBuild tree sequentiallyCost: O(q3)Step 3: Progressive alignmentAlign, starting at leaves of treeCost: O(ql2)* q sequences – mean length lCMSC 838T – PresentationOptimization: Query cachingStep 1: Pairwise alignment, building distance matrixMany requests partially duplicatedIndividual distance calculation not dependent on rest of queryObservation: Dominant step in execution timeSteps 2, 3:Output dependent on results of entire queryResults less reusableTechnique: cache output of step 1Individual distancesMLI… GIS… QPA…MLISHSDLNQ… 0.0GISRETSS… 0.0QPAKKTYTW… 0.0Query 1Query 2MLI… GIS… MST…MLISHSDLNQ… 0.0GISRETSS… 0.0MSTVTKYFYKGE… 0.0CMSC 838T – PresentationChallenges to cache implementationI/O and filesystem overheadLarge cache vs. 2GB file size limitHigh seek times within single fileSearch and insertion overheadSequence: lengthy key Keyed on each pair of sequencesCMSC 838T – PresentationTechnique: 2-level B-Tree cacheLevel 1: Map sequence text to sequence IDHash of sequence?Sequentially assigned numberCache size: O(ql)Level 2: Map ID pairs to calculated distanceConcatenate IDs from level 1Lower Level 1 ID -> upper half of Level 2 keyCache size: O(q2)Distribute level 2 cache across binsRound robin or block allocatedDistribute bins across machines* q sequences – mean length lCMSC 838T – PresentationSMPParallelizable: Pairwise searches performed independentlyFarmed out to query threadsWeb serverQueryThreadQueryThreadQueryThreadQueryThreadLevel 1 maps(per-machine)Level 2 bins(distributed)CMSC 838T – PresentationSMPChallenge: Cache coherenceRead-only?Requires advance knowledge of query detailsOnline update and serialization?Locking, duplicate updatesOffline updates?Per-thread list of cache changesQueryThreadQueryThreadQueryThreadQueryThreadLevel 1 maps(per-machine)Level 2 bins(distributed)CMSC 838T – PresentationEvaluation: ImplementationPublic B-Tree implementation: GIST libraryFirst evaluation on Intel PC(Pentium III 650, 75GB disks)q = 25-1000 sequencesl = 450 amino acids per sequenceSecond evaluation on Sun Fire(Sun Fire 6800, 48*750MHz CPUs, 48GB main memory)l = 417 amino acids per sequenceq = 2-200 sequencesSeeded cache with dummy valuesFuture work: architectural impactCMSC 838T – PresentationEvaluation: ResultsCMSC 838T – PresentationObservationsSimple techniqueCheap and easy to implementCheap and easy to deployUnsupported claim: Are queries really similar?Concern about distribution across processorsPaper mentions latency, workload balancingAlso reliability of distributed binsCache lifetimes?Proposed solution “component-based system”“Hand-wavey”; would like to see


View Full Document

UMD CMSC 838T - Improving performance of Multiple Sequence Alignment in Multi-client Environments

Documents in this Course
Load more
Download Improving performance of Multiple Sequence Alignment in Multi-client Environments
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Improving performance of Multiple Sequence Alignment in Multi-client Environments and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Improving performance of Multiple Sequence Alignment in Multi-client Environments 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?