Parallel Detection of Regulatory Elements with gMPMotivationTalk OverviewTechniqueSlide 5gRNA frameworkgRNA - APIsgRNA environmentgRNA GridgMPSlide 11REDUCE algorithmSlide 13REDUCE methodSlide 15....Table: Finding significant motifsREDUCE parallelised with gMP......ExperimentResultsAnalysisRelated workSlide 24ComparisonObservationsSlide 27ConclusionsParallel Detection of Regulatory Elements with gMP Bertil Schmidt, Lin Feng, Amey Laud, Yusdi SantosoDamayanti GuptaCMSC 838 PresentationCMSC 838T – PresentationMotivationFundamental questionHow are expression levels of thousands of genes regulated ?Very importantUnderstanding of gene functionResponse to environmentUnderstand genetic causes of diseases Evaluate effects of drusDetect mutationsRememberSets of genes -> Pathways -> Genetic NetworksGene regulationControl decisions turn genes on/offGene Regulation NetworkCMSC 838T – PresentationTalk OverviewOverview of talkMotivationTechniqueExperimentRelated workConclusionsCMSC 838T – PresentationTechniqueMotifs upstream of genes regulate gene expressionMotifs are sites of regulatory activityIdentify regulatory motifs by combiningGene expression dataDetect common motifs occuring upstream of genesHuge datasetsUtilise parallel computingCMSC 838T – PresentationTechniquegRNAJava development frameworkgMPJava communication libraryREDUCEAlgorithm to identify regulatory motifsREDUCE parallelised with gMPIncrease computing powerGet motifs ranked in statistical significanceCMSC 838T – PresentationgRNA frameworkConsists of APIsCMSC 838T – PresentationgRNA - APIsInteract with data sourcesProvide functionality from biologyPipelines tasks into unified processRepository of resourcesDistributed programmingCMSC 838T – PresentationgRNA environmentgRNA GridClustered computing environmentApplication written for gRNAMultiple-tier applicationApplications operate from client computerCommunicates with cluster through single computerHosts EJB serverServer identifies processing nodeseach of these perform tasksCMSC 838T – PresentationgRNA GridCMSC 838T – PresentationgMPJava based message passing toolBuilt on top of socketsManages virtual processors to run on available machinesScalableMachines added/removed easilyCMSC 838T – PresentationgMPProcesses are groupedCommunication primitives provided for sending and receiving dataCollective communication to several nodes enabled modularly and efficientlyEnables functions to be implemented on dataCMSC 838T – PresentationREDUCE algorithmBased on model Upstream motifs contribute additively to expression level of each geneQuantify the extent to which these motifs contribute to expression dataFit log of expression ratio to sum of activating and inhibitory termsFind stastically most significant motifsPlots of fitting parameters suggest biological functionCMSC 838T – PresentationREDUCE algorithmTermsOccurence vectorMeasure of how often a motif is foundExpression vectorMeasure of gene expressionCMSC 838T – PresentationREDUCE methodConsists of1) Motif frequency countercounts occurrences of DNA motifs upstream of each ORFmotifs are about 7~11 nucleotides in lengthget occurence vectorsCMSC 838T – PresentationREDUCE algorithm2) Significant motif finderUsei) Normalised occurrence vector made for each motif nμii) Normalised vector of logs of gene expression ratio vectors- aTake dot product of these (a . nμ) ,and square.Can be considered as frequency of occurence X expressive power of regulatory motifIt is squared to get rid of negativesCorrelate gene expression with occurence of motifLargest dot product is most significant motifCMSC 838T – Presentation....a is modified to remove effect of this motifresidual gene expression vectorProcess repeated until motifs are rankedCMSC 838T – PresentationTable: Finding significant motifsUses a - (.5816,.2522,.2886,-.5947, -.1595, -.3683)CMSC 838T – PresentationREDUCE parallelised with gMP...Parallel motif frequency counterSplit set of ORFs equallyDistribute across available nodesEach node calculates in parallel to get occurence vectorsMatrix transpositionOccurence vectors scattered across nodesAdvantageous to store each vector in single nodeTranspose motif frequency matrixFor each ORF can only calculate fraction of occurence frequencies for all motifsBut the entire occurence frequency is neededCMSC 838T – Presentation...Parallel significant motif finderNormalises occurence vector within each nodeAt each node, most significant motif calculatedGlobal most significant motif calculatedProcess iterated to rank occurence vectorsInterface in gRNA allows ease of implementationCMSC 838T – PresentationExperimentUse Compaq Alpha systemConsists of cluster of 8 AlphaServer SC/ES45 Connected by high-speed Alpha SC 16-Port switch and ELAN PCI adapter cards.Each server contains 4 Alpha EV68 processorsCMSC 838T – PresentationResultsUse 7090 gene expressions of yeastORFs of length 600Motifs upto length 7 Throughput (in MBytes/s) also shown20 most significant motifs computed.CMSC 838T – PresentationAnalysisRuntime scales well with number of processing nodesFrequency counter scales perfectlyMotif finder also scalesCannot achieve perfect scaling because of communication overhead.CMSC 838T – PresentationRelated workDiscoveryLinkProvides configurable wrappers as interfaces to multiple data sourcesKleisli systemSystematically manages and integrates external databasesUses functional query language to perform correlation across databasesToolkits designed with functionality for specialised areas BioJava, BioPerl, PALSequence AnalysisEnsembl initiative, DASprovide extensible approach to issue of annotating genomic dataCMSC 838T – PresentationRelated workPrevious approaches using Java for high performance computingBindings into native message-passing APIs(e.g.MPI)Does not allow easy integration into larger Java applicationsPure Java message passing interfacesJMPI, CCJBoth implemented on top of Java RMI–Slower than using raw socketsCCJ tries to overcome–optimised RMI implementation–not portableBoth cannot handle integrationCMSC 838T –
View Full Document