Improving performance of Multiple Sequence Alignment in Multi-client EnvironmentsOverviewMotivationCLUSTALW: Progressive global alignmentOptimization: Query cachingChallenges to cache implementationTechnique: 2-level B-Tree cacheSMPSlide 9Evaluation: ImplementationEvaluation: ResultsObservationsImproving performance of Multiple Sequence Alignment in Multi-client EnvironmentsAaron ZollmanCMSC 838 PresentationCMSC 838T – PresentationOverviewOverview of talkCLUSTALW algorithm, speedup opportunitiesProblems with cachingParallelizing techniqueWeaknessesApplying technique to other bioinformatics problemsCMSC 838T – PresentationMotivationQuery overlap in queries submitted to MSA toolsSingle researcher: new sequences vs. databaseMultiple researchers: similar subsetsCLUSTALW: Progressive algorithmThree stepsProgressive refinementOpportunities for speedupCachingQuery orderingCMSC 838T – PresentationCLUSTALW: Progressive global alignmentStep 1: Pairwise alignment, distance matrixFast technique calculates distance between two scoresCalculated for all sequence pairsCost: O(q2l2) Step 2: Guide treeGroup nearest firstBuild tree sequentiallyCost: O(q3)Step 3: Progressive alignmentAlign, starting at leaves of treeCost: O(ql2)* q sequences – mean length lCMSC 838T – PresentationOptimization: Query cachingStep 1: Pairwise alignment, building distance matrixMany requests partially duplicatedIndividual distance calculation not dependent on rest of queryObservation: Dominant step in execution timeSteps 2, 3:Output dependent on results of entire queryResults less reusableTechnique: cache output of step 1Individual distancesMLI… GIS… QPA…MLISHSDLNQ… 0.0GISRETSS… 0.0QPAKKTYTW… 0.0Query 1Query 2MLI… GIS… MST…MLISHSDLNQ… 0.0GISRETSS… 0.0MSTVTKYFYKGE… 0.0CMSC 838T – PresentationChallenges to cache implementationI/O and filesystem overheadLarge cache vs. 2GB file size limitHigh seek times within single fileSearch and insertion overheadSequence: lengthy key Keyed on each pair of sequencesCMSC 838T – PresentationTechnique: 2-level B-Tree cacheLevel 1: Map sequence text to sequence IDHash of sequence?Sequentially assigned numberCache size: O(ql)Level 2: Map ID pairs to calculated distanceConcatenate IDs from level 1Lower Level 1 ID -> upper half of Level 2 keyCache size: O(q2)Distribute level 2 cache across binsRound robin or block allocatedDistribute bins across machines* q sequences – mean length lCMSC 838T – PresentationSMPParallelizable: Pairwise searches performed independentlyFarmed out to query threadsWeb serverQueryThreadQueryThreadQueryThreadQueryThreadLevel 1 maps(per-machine)Level 2 bins(distributed)CMSC 838T – PresentationSMPChallenge: Cache coherenceRead-only?Requires advance knowledge of query detailsOnline update and serialization?Locking, duplicate updatesOffline updates?Per-thread list of cache changesQueryThreadQueryThreadQueryThreadQueryThreadLevel 1 maps(per-machine)Level 2 bins(distributed)CMSC 838T – PresentationEvaluation: ImplementationPublic B-Tree implementation: GIST libraryFirst evaluation on Intel PC(Pentium III 650, 75GB disks)q = 25-1000 sequencesl = 450 amino acids per sequenceSecond evaluation on Sun Fire(Sun Fire 6800, 48*750MHz CPUs, 48GB main memory)l = 417 amino acids per sequenceq = 2-200 sequencesSeeded cache with dummy valuesFuture work: architectural impactCMSC 838T – PresentationEvaluation: ResultsCMSC 838T – PresentationObservationsSimple techniqueCheap and easy to implementCheap and easy to deployUnsupported claim: Are queries really similar?Concern about distribution across processorsPaper mentions latency, workload balancingAlso reliability of distributed binsCache lifetimes?Proposed solution “component-based system”“Hand-wavey”; would like to see
View Full Document