Parallel EST Clustering by Kalyanaraman, Aluru, and KothariTalk OverviewMotivation: EST ClusteringMotivationBackground: EST Clustering ToolsEST Clustering ToolsEST Clustering Tools’ PerformanceGoalApproachApproach (Cont ..)Approach (Cont …)TriesSuffix Tries (Cont ..)Slide 14Slide 15Parallel Generation of GSTParallel Generation of GST (Cont …)Slide 18Slide 19On-demand Pair GenerationSlide 21Parallel ClusteringParallel Clustering (Cont…)Parallel Clustering (Cont..)Slide 25Experimental environmentQuality AssessmentQuality Assessment (Cont …)Slide 29Quality Assessment (Cont..)Run-time AssessmentRun-time Assessment (Cont …)Run-time Assessment (Cont ..)ResultsObservationsReferences1Parallel EST ClusteringbyKalyanaraman, Aluru, and KothariNargess MemarsadeghiCMSC 838 PresentationCMSC 838T – Presentation2Talk OverviewOverview of talkMotivationBackgroundTechniquesEvaluationRelated workObservationsCMSC 838T – Presentation3Motivation: EST ClusteringProblem: EST ClusteringCluster fragments of cDNARelated to ‘fragment assembly’ problemDetecting overlapping fragments Overlaps can be computed:Pairwise alignment algorithmDynamic programmingAlternative:Approximate overlap detection algorithmsDynamic programmingCMSC 838T – Presentation4MotivationCommon Tools:Takes too longDays for 100,000 ESTsRuns out of memoryThis paper:PaCE: Parallel Clustering of ESTsEfficient parallel EST ClusteringSpace efficient algorithmReduce total workReduce run-timeCMSC 838T – Presentation5Background: EST Clustering ToolsThree traditional software:Originally designed for fragment assembly:TIGR AssemblerPhrapCAP3One parallel software: UICLUSTER: assumes EST’s from 3’ endCMSC 838T – Presentation6EST Clustering ToolsBasic approachFind pairs of similar sequencesAlign similar pairsDynamic programingQuality of EST clusteringPhrap: Fastest avoids dynamic programmingRelies on approximation, lower qualityCAP: Least # of erroneous clustersCMSC 838T – Presentation7 EST Clustering Tools’ PerformanceWith 50,000 maize ESTsUsing PC with dual Pentium 450MHZ , 512 RAM :TIGR: ran out of memoryPhrap: 40 minCAP: > 24 hoursWith 100,000 maize ESTsall ran out of memoryCAP would require 4 daysCMSC 838T – Presentation8GoalSpace efficient algorithmSpace requirement linear in the size of the input data setReduce total workWithout sacrificing quality of clusteringReduce run-time and facilitate the clustering of large data sets Through parallel processingScale memory with # of processorsCMSC 838T – Presentation9ApproachExpense:Pairwise alignment (time + memory)Promising pairs ≈ Common string: |s|= wCost: if common |s|=l > w , then repeats l-w+1 times 2)(# ESTCMSC 838T – Presentation10Approach (Cont ..)Approach: Use trie structure Identify promising pairsMerge clusters with strong overlapsAvoid storing/testing all similar pairsParallel EST Clustering Software:Generalized Suffix Tree (GST) Multiple processors:Maintain and updates EST ClustersOthers generate batches of promising pairs, perform alignmentCMSC 838T – Presentation11Approach (Cont …)CMSC 838T – Presentation12 Tries1) Index for each char2) N leaves3) Height NCMSC 838T – Presentation13Suffix Tries (Cont ..)1) TRIM suffix trieCMSC 838T – Presentation14Suffix Tries (Cont ..)1) Indicies2) Storage O(n), constant is high though3) Common string4) Longest common substringCMSC 838T – Presentation15Suffix Tries (Cont ..)12abab$ab$b3$4$5Given a pattern P = ab we traverse the tree according to the pattern.CMSC 838T – Presentation16Parallel Generation of GSTGST: Generalized Suffix TreeCompacted trieLongest common prefix found in constant time Used for on-demand pair generationSequential: O(nl)Parallel: O(nl/p)CMSC 838T – Presentation17Parallel Generation of GST (Cont …)Previous implementations:CRCW/CREW PRAM modelWork-optimalInvolves alphabetical ordering of charactersUnrealistic assumptionssynchronous operation of processorsinfinite network bandwidthno memory contentionNot practically efficientCMSC 838T – Presentation18Parallel Generation of GST (Cont …)Paper’s approach:EST’s equally distributed among processors Each processor Partitions suffixes of ESTs into buckets Distribute buckets to the processors:All suffixes in a bucket allocated to the same processorTotal # of suffixes allocated to a processor ≈ O ( )w||pnlCMSC 838T – Presentation19Parallel Generation of GST (Cont …)Each bucket’s processor:Compute compacted trie of all its suffixesCannot use sequential constructionSuffixes of a string –not in the same bucketEach bucket:Subtree in the GST Nodes:Depth first search traversal of the triePointer to the right most childCMSC 838T – Presentation20On-demand Pair GenerationA pair should be generated ifShare substring of length ≥ treshholdMaximalLeaves in a common nodeShare a substring of length = depth of nodeParallel algorithmEach processor works with its trie ifDepth of its root in GST < threshholdCMSC 838T – Presentation21On-demand Pair GenerationTo processSort internal nodes Decreasing order of depthLists of a nodeGenerated after processRemoved after parent is processedLimits space O(nl)Run time ≈ # pairs generated + cost of sortingRejected pairs increase run-time by a factor of 2Eliminating duplicates reduce run-timeCMSC 838T – Presentation22Parallel ClusteringMaster-Slave paradigm:Master processor:Maintains and updates clustersUsing union-find data structureReceives messages from slave processors–A batch of next promising pairs generated by slave–Results of the pairwise alignmentDetermines which ones to exploreDetermines if merging should occurSlave processors:Generate pairs on demandPerform pairwise alignments of pairs dispatched by the master processorCMSC 838T – Presentation23Parallel Clustering (Cont…)Organization of Parallel Clustering SoftwareMasterPSlavePSlavePslaveP•Batch of promising pairs generated + results of pairwise alignment•Batchsize or fewer # of pairs + results of pairwise alignemnt on each pairCMSC 838T – Presentation24Parallel Clustering (Cont..)To start:Slave P
View Full Document