Introduction to Research 2011OutlineResearch AreasImportance of SupercomputingSome ApplicationsSupercomputing PowerGeographic DistributionAsian Supercomputing TrendsChallenges in SupercomputingArchitectural TrendsAccelerating Applications with GPUsSmall Discrete Fourier Transforms (DFT) on GPUsComparison of DFT PerformancePetascale Quantum Monte CarloLoad BalancingPerformance ComparisonProcess-Node AffinityLoad Balancing with AffinityPotential Research TopicsIntroduction to Research 2011Introduction to Research 2011 Ashok SrinivasanFlorida State Universitywww.cs.fsu.edu/~asrinivaAshok SrinivasanFlorida State Universitywww.cs.fsu.edu/~asrinivaImages from ORNL, IBM, NVIDIAImages from ORNL, IBM, NVIDIAPart of the machine room at ORNLPart of the machine room at ORNLThe Cell processor powers the Roadrunner at LANLThe Cell processor powers the Roadrunner at LANLNVIDIA GPUs power Tianhe-1A in ChinaNVIDIA GPUs power Tianhe-1A in ChinaOutlineOutlineResearch High Performance Computing Applications and SoftwareMulticore processors Massively parallel processorsComputational nanotechnologySimulation-based policy makingPotential Research TopicsResearch AreasResearch AreasHigh Performance Computing, Applications in Computational Sciences, Scalable Algorithms, Mathematical SoftwareCurrent topics: Computational Nanotechnology, HPC on Multicore Processors, Massively Parallel ApplicationsNew Topics: Simulation-based policy analysisOld Topics: Computational Finance, Parallel Random Number Generation, Monte Carlo Linear Algebra, Computational Fluid Dynamics, Image CompressionImportance of SupercomputingImportance of SupercomputingFundamental scientific understanding Nano-materials, drug designSolution of bigger problemsClimate modelingMore accurate solutionsAutomobile crash testsSolutions with time constraintsDisaster mitigationStudy of complex interactions for policy decisionsUrban planningSome ApplicationsSome ApplicationsIncreasing relevance to industryIn 1993, fewer than 30% of top 500 supercomputers were commercial, now, 57% are commercialA variety of application areasCommercialFinance and insuranceMedicineAerospace and AutomobilesTelecomOil explorationShoes! (Nike)Potato chips!Toys!ScientificWeather predictionEarthquake modelingEpidemic modelingMaterialsEnergyComputational biologyAstro-physicsSupercomputing PowerSupercomputing PowerThe amount of parallelism too is increasing, with the high end having over 200,000 coresThe amount of parallelism too is increasing, with the high end having over 200,000 coresGeographic DistributionGeographic DistributionNorth America has over half the top 500 systemsHowever, Europe and East Asia too have a significant shareChina is determined to be a supercomputing superpowerTwo of its national supercomputing centers have top-five supercomputersJapan has the top machine and two in the top fivePlanning a $ 1.3 billion exascale supercomputer in 2020Asian Supercomputing TrendsAsian Supercomputing TrendsChallenges in SupercomputingChallenges in SupercomputingHardware can be obtained with enough moneyBut obtaining good performance on large systems is difficultSome DOE applications ran at 1% efficiency on 10,000 coresThey will have to deal with a million threads soon, and with a billion at the exa-scaleDon’t think of supercomputing as a means of solving current problems faster, but as a means of solving problems we earlier thought we could not solveDevelopment of software tools to make use of the machines easierArchitectural TrendsArchitectural TrendsMassive parallelism10K processor systems will be commonplaceLarge end already has over 500K processorsSingle chip multiprocessingAll processors will be multicoreHeterogeneous multicore processorsCell used in the PS3GPGPU80-core processor from IntelProcessors with hundreds of cores are already commercially availableDistributed environments, such as the GridBut it is hard to get good performance on these systemsAccelerating Applications with GPUsAccelerating Applications with GPUsOver a hundred cores per GPUHide memory latency with thousands of threadsCan accelerate a traditional computer to a teraflopGPU cluster at FSUQuantum Monte Carlo applicationsAlgorithmsLinear algebra, FFT, compression, etcSmall Discrete Fourier Transforms Small Discrete Fourier Transforms (DFT) on GPUs(DFT) on GPUsGPUs are effective for large DFTs, but not small DFTsHowever, they can be effective for a large number of small DFTsUseful for AFQMC We use the asymptotically slow matrix-multiplication based DFT for very small sizesWe combine it with mixed-radix for larger sizesWe use asynchronous memory transfer to deal with host-device data transfer overheadComparison of DFT PerformanceComparison of DFT PerformanceComparison of 512 simultaneous DFTs without host-device data transfer2-D DFTs3-D DFTsPetascale Quantum Monte CarloPetascale Quantum Monte CarloOriginally a DOE funded project involving collaboration between ORNL, UIUC, Cornell, UTK, CWM, and NCSUNow funded by ORAU/ORNLScale Quantum Monte Carlo applications to petascale (one million gigaflops) machinesLoad balancing, fault tolerance, other optimizationsLoad BalancingLoad BalancingIn current implementations, such as QWalk and QMCPack, cores send excess walkers to cores with fewer walkersIn the new algorithm (alias method), cores may send more than their excess, and receive walkers even if they originally had an excessLoad can be balanced with each core receiving from at most one other coreAlso optimal in maximum number of walkers receivedTotal number of walkers sent may be twice the optimalPerformance ComparisonPerformance ComparisonMean number of walkers migratedMaximum number of receivesComparisons with QWalkProcess-Node AffinityProcess-Node AffinityNode allocation is not necessarily ideal for minimizing communicationProcess-node affinity can, therefore, be importantAllocated nodes for a 12,000 core run on JaguarLoad Balancing with AffinityLoad Balancing with AffinityRenumbering the nodes improves load balancing and AllGather timeBasic load balancing Load balancing after renumberingResults on JaguarPotential Research TopicsPotential Research TopicsHigh Performance Computing on Multicore ProcessorsAlgorithms, Applications, and
or
We will never post anything without your permission.
Don't have an account? Sign up