UMD CMSC 838T - Parallel EST Clustering - D2549254

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 838T> Parallel EST Clustering

DOC PREVIEW

UMD CMSC 838T - Parallel EST Clustering

School name University of Maryland, College Park

Course Cmsc 838t- Advanced Topics in Programming Languages

Pages 36

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 36 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Parallel EST Clustering by Kalyanaraman, Aluru, and KothariTalk OverviewMotivation: EST ClusteringMotivationBackground: EST Clustering ToolsEST Clustering ToolsEST Clustering Tools’ PerformanceGoalApproachApproach (Cont ..)Approach (Cont …)TriesSuffix Tries (Cont ..)Slide 14Slide 15Parallel Generation of GSTParallel Generation of GST (Cont …)Slide 18Slide 19On-demand Pair GenerationSlide 21Parallel ClusteringParallel Clustering (Cont…)Parallel Clustering (Cont..)Slide 25Experimental environmentQuality AssessmentQuality Assessment (Cont …)Slide 29Quality Assessment (Cont..)Run-time AssessmentRun-time Assessment (Cont …)Run-time Assessment (Cont ..)ResultsObservationsReferences1Parallel EST ClusteringbyKalyanaraman, Aluru, and KothariNargess MemarsadeghiCMSC 838 PresentationCMSC 838T – Presentation2Talk OverviewOverview of talkMotivationBackgroundTechniquesEvaluationRelated workObservationsCMSC 838T – Presentation3Motivation: EST ClusteringProblem: EST ClusteringCluster fragments of cDNARelated to ‘fragment assembly’ problemDetecting overlapping fragments Overlaps can be computed:Pairwise alignment algorithmDynamic programmingAlternative:Approximate overlap detection algorithmsDynamic programmingCMSC 838T – Presentation4MotivationCommon Tools:Takes too longDays for 100,000 ESTsRuns out of memoryThis paper:PaCE: Parallel Clustering of ESTsEfficient parallel EST ClusteringSpace efficient algorithmReduce total workReduce run-timeCMSC 838T – Presentation5Background: EST Clustering ToolsThree traditional software:Originally designed for fragment assembly:TIGR AssemblerPhrapCAP3One parallel software: UICLUSTER: assumes EST’s from 3’ endCMSC 838T – Presentation6EST Clustering ToolsBasic approachFind pairs of similar sequencesAlign similar pairsDynamic programingQuality of EST clusteringPhrap: Fastest avoids dynamic programmingRelies on approximation, lower qualityCAP: Least # of erroneous clustersCMSC 838T – Presentation7 EST Clustering Tools’ PerformanceWith 50,000 maize ESTsUsing PC with dual Pentium 450MHZ , 512 RAM :TIGR: ran out of memoryPhrap: 40 minCAP: > 24 hoursWith 100,000 maize ESTsall ran out of memoryCAP would require 4 daysCMSC 838T – Presentation8GoalSpace efficient algorithmSpace requirement linear in the size of the input data setReduce total workWithout sacrificing quality of clusteringReduce run-time and facilitate the clustering of large data sets Through parallel processingScale memory with # of processorsCMSC 838T – Presentation9ApproachExpense:Pairwise alignment (time + memory)Promising pairs ≈ Common string: |s|= wCost: if common |s|=l > w , then repeats l-w+1 times 2)(# ESTCMSC 838T – Presentation10Approach (Cont ..)Approach: Use trie structure Identify promising pairsMerge clusters with strong overlapsAvoid storing/testing all similar pairsParallel EST Clustering Software:Generalized Suffix Tree (GST) Multiple processors:Maintain and updates EST ClustersOthers generate batches of promising pairs, perform alignmentCMSC 838T – Presentation11Approach (Cont …)CMSC 838T – Presentation12 Tries1) Index for each char2) N leaves3) Height NCMSC 838T – Presentation13Suffix Tries (Cont ..)1) TRIM suffix trieCMSC 838T – Presentation14Suffix Tries (Cont ..)1) Indicies2) Storage O(n), constant is high though3) Common string4) Longest common substringCMSC 838T – Presentation15Suffix Tries (Cont ..)12abab$ab$b3$4$5Given a pattern P = ab we traverse the tree according to the pattern.CMSC 838T – Presentation16Parallel Generation of GSTGST: Generalized Suffix TreeCompacted trieLongest common prefix found in constant time Used for on-demand pair generationSequential: O(nl)Parallel: O(nl/p)CMSC 838T – Presentation17Parallel Generation of GST (Cont …)Previous implementations:CRCW/CREW PRAM modelWork-optimalInvolves alphabetical ordering of charactersUnrealistic assumptionssynchronous operation of processorsinfinite network bandwidthno memory contentionNot practically efficientCMSC 838T – Presentation18Parallel Generation of GST (Cont …)Paper’s approach:EST’s equally distributed among processors Each processor Partitions suffixes of ESTs into buckets Distribute buckets to the processors:All suffixes in a bucket allocated to the same processorTotal # of suffixes allocated to a processor ≈ O ( )w||pnlCMSC 838T – Presentation19Parallel Generation of GST (Cont …)Each bucket’s processor:Compute compacted trie of all its suffixesCannot use sequential constructionSuffixes of a string –not in the same bucketEach bucket:Subtree in the GST Nodes:Depth first search traversal of the triePointer to the right most childCMSC 838T – Presentation20On-demand Pair GenerationA pair should be generated ifShare substring of length ≥ treshholdMaximalLeaves in a common nodeShare a substring of length = depth of nodeParallel algorithmEach processor works with its trie ifDepth of its root in GST < threshholdCMSC 838T – Presentation21On-demand Pair GenerationTo processSort internal nodes Decreasing order of depthLists of a nodeGenerated after processRemoved after parent is processedLimits space O(nl)Run time ≈ # pairs generated + cost of sortingRejected pairs increase run-time by a factor of 2Eliminating duplicates reduce run-timeCMSC 838T – Presentation22Parallel ClusteringMaster-Slave paradigm:Master processor:Maintains and updates clustersUsing union-find data structureReceives messages from slave processors–A batch of next promising pairs generated by slave–Results of the pairwise alignmentDetermines which ones to exploreDetermines if merging should occurSlave processors:Generate pairs on demandPerform pairwise alignments of pairs dispatched by the master processorCMSC 838T – Presentation23Parallel Clustering (Cont…)Organization of Parallel Clustering SoftwareMasterPSlavePSlavePslaveP•Batch of promising pairs generated + results of pairwise alignment•Batchsize or fewer # of pairs + results of pairwise alignemnt on each pairCMSC 838T – Presentation24Parallel Clustering (Cont..)To start:Slave P

View Full Document