UW-Madison ME 964 - Supercomputing for the Masses - Killer-Apps, Parallel Mappings, Scalability and Application Lifespan - D602983

Home> Schools> University of Wisconsin, Madison> Mechanical Engineering (ME) > ME 964> Supercomputing for the Masses - Killer-Apps, Parallel Mappings, Scalability and Application Lifespan

UW-Madison ME 964 - Supercomputing for the Masses - Killer-Apps, Parallel Mappings, Scalability and Application Lifespan

School name University of Wisconsin, Madison

Course Me 964- High Performance Computing for Engineering Applications

Pages 56

Download Save

Unformatted text preview:

Supercomputing for the Masses: Killer-Apps, Parallel Mappings, Scalability and Application LifespanRob FarberSenior Scientist (PNNL) -> Visiting scientist ICHECKiller Appsoccur when personal vision matches technical capability to fulfill a market demand• Graphics processors and games: killer apps that created a huge market2Technical capability• Market forces evolved GPUs into massively parallel GPGPUs (General Purpose GPUs).• 250+ million (1/4 billion) CUDA-enabled GPUs says it all!• CUDA: put supercomputing in the hands of the masses.o December 1996, ASCI Red the first teraflop supercomputer o Today: kids buy GPUs with flop rates comparable to systems available to scientists with supercomputer access in the mid to late 1990s.• GeForce 480 1.35 TF/s peak 32-bit• Newegg.com: $2993Remember that Finnish kid who wrote some software to understand operating systems? Inexpensive commodity hardware enables:• New thinking• A large educated base of developersA perfect storm of opportunities and technology(Summary of Farber, Scientific Computing, “Realizing the Benefits of Affordable Teraflop-capable Hardware”)• Multi-threaded software is a must-have because manufacturers were forced to move to multi-core CPUs o The failure of Dennard’s scaling laws meant processor manufacturers had to add cores to increase performance and entice customers• Multi-core is disruptive to single-threaded legacy appso Businesses and research efforts will not benefit from new hardware unless they invest in multi-threaded softwareo Lack of investment risks stagnation and losing to the competition• Competition is fierce, the new technology is readily available and it is inexpensive!o Which software and models? Look to those that are: • Widely adopted and have withstood the test of time• Look at CUDA and the CUDA model4CUDA is not the only game in town(but will be a focus of this talk)• Android/Iphone - mobile is huge(2008) (2009) 55 weeks- 3.5M downloads- Over $100K netThe technical capability is there …you supply the vision!Augmented RealityJen-Hsun with RTT at 2009 GTCCUDA is a game changer!• CUDA enables orders of magnitude faster apps:o 10x can make computational workflows more interactive (even poorly performing GPU apps are useful).o 100x is disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers.o 1000x and greater achieved through the use of the NVIDIA SFU (Special Function Units) or multiple GPUs … Whooo Hoooo!• In a few slides: examine CUDA + Graphics = Wow!6CUDA was adopted amazingly fast!• February 2007: The initial CUDA SDK was made public.• Now: CUDA-based GPU Computing is part of the curriculum at over 360 universities.o MIT, Harvard, Cambridge, Oxford, the Indian Institutes of Technology, National Taiwan University, and the Chinese Academy of Sciences. 7The numbers have changed!8http://e-ditionsbyfry.com/Olive/ODE/SCISupp/Default.aspx?href=SCI/2010/11/03&pageno=06&view=documentApplication speed says it all!(fastest 100 apps in the NVIDIA Showcase Sept. 8, 2010)00.511.522.533.541 11 21 31 41 51 61 71 81 91Ranked speedup by project (best to worst)Speedup (order of magnitude)9Max: 2600xMedian: 253xMin: 98xURL: http://www.nvidia.com/object/cuda_apps_flash_new.html click on Sort by Speed UpOrders of magnitude increased performance in an extraordinary number of fields• Spanning a wide-range of computational, data driven, and real-time applications:o Computational financeo Medicalo Quantum chemistry simulationso Molecular modeling and electrostatic potentialso Diffusiono Fluid flowo Systems of differential equations o Data driven problems such as microscopy• Many can be considered killer apps in their field.10An example: the Metropolis algorithm 300x – 1000x• Among the ten algorithms that have had the greatest influence on the development and practice of science and engineering in the 20th century (Beichl, Sullivan, 2000).• Plays a significant role in statistics, econometrics, physics and computing science. o For some applications, MCMC simulation is the only known general approach for providing a solution within a reasonable time (Diaconis, Saloff-Coste, 1995).• CUDA version reported to be 300x to 1000x faster (Alerstam, Svensson, Engels, 2008).11Three rules for fast GPU codes1. Get the data on the GPU (and keep it there!)• PCIe x16 v2.0 bus: 8 GiB/s in a single direction • 20-series GPUs: 140-200 GiB/s2. Give the GPU enough work to do• Assume 10 µs latency and 1 TF device• Can waste (10-6* 1012) = 1M operations3. Reuse and locate data to avoid global memory bandwidth bottlenecks• 1012flop hardware delivers 1010flop when global memory limited• Can cause a 100x slowdown!Corollary: Avoid malloc/free!12Application lifespanSIMD: a key from the pastFarber: general SIMD mapping from the 1980so Acknowledgements: Work performed at or funded by the Santa Fe Institute, the theoretical division at Los Alamos National Laboratory and various NSF, DOE and other funding sources including the Texas Advance Computer Center.Observed Peak Effective Rate vs. Number of Ranger Cores0501001502002503003504000 10000 20000 30000 40000 50000 60000 70000Number of Barcelona coresEffective Rate (TF/s)13Results presented at SC09 (courtesy TACC)This mapping for Neural Networks …“Most efficient implementation to date” (Singer 1990), (Thearling 1995)The Connection Machine60,000 cores: 363 TF/s measured62,796 cores: 386 TF/s (projected)The Parallel Mappingenergy = objFunc(p1, p2, … pn)14Examples0, N-1ExamplesN, 2N-1Examples2N, 3N-1Examples3N, 4N-1Step 2Calculate partialsStep 3Sum partials to get energyStep1Broadcast parametersOptimization Method(Powell, Conjugate Gradient, Other)p1,p2, … pnGPU 1p1,p2, … pnGPU 2p1,p2, … pnGPU 3p1,p2, … pnGPU 4Principle Components Analysis (PCA)• A widely used technique in data-mining and data reductiono Demonstrate a method proposed by Sanger (1989)15B B BO O O OO O O O OO O O O OO O O O OOO O O OOI I I II I I I II I I I II I I I III I I IIScales according to dataExtends to Nonlinear PCA (NLPCA)E. Oja, J. Karhunen, L. Wang, and R. Vigario, 1995This is a general mapping(think of your own applications!)o Optimizationo Locally Weighted Linear Regression (LWLR)o Neural Networkso Naive Bayes (NB)o Gaussian Discriminative Analysis (GDA)o k-meanso Logistic Regression (LR)o Independent Component Analysis (ICA)o Expectation Maximization (EM) o Support Vector

View Full Document