ME964 High Performance Computing for Engineering ApplicationsBefore we get started…Parallel Programming Support (non-GPU)GPU Parallel Programming SupportWhy Dedicate So Much Time to GPU?GPU vs. CPU Flop Rate ComparisonFastest Computers in the WorldIBM BlueGene/LWhy GPU?Four GPU Lab DesktopEnd: Overview of H&S for parallel computing Beginning: GPU Computing, CUDA Programming ModelAcknowledgementsLayout of Typical Hardware ArchitectureGPU Computing – The Basic IdeaWhat is GPGPU ?Why GPU for Computing?GPGPU ConstraintsCUDAParallel Computing on a GPURunning code on Parallel ComputersThe CUDA Way: Extended CCompiling CUDAMore on the nvcc compilerCompiling CUDA extended CBegin: Memory Layout on the GPUCUDA Programming Model: A Highly Multithreaded CoprocessorBefore We Dive In: An Overview of the GeForce-8 Series HWGPU Processor TerminologyExecution Configuration: Grids and BlocksME964High Performance Computing for Engineering ApplicationsC Programming Intro, wrap up &Brief Overview of Parallel ComputingSept. 11, 2008Before we get started…Last Time Wrapped up the C Programming IntroOverview of Parallel ComputingThe Sequential vs. Parallel paradigm in computingThe Hardware element of the Parallel Computing equationThe Software element of the Parallel Computing equationTodayHW1 due at 11:59 PMHW2 assigned (available on the website)Start discussion on GPU as our vehicle for parallel computingTrace back the evolution of the GPUMemory layout, NVIDIA GPUThe CUDA API2Parallel Programming Support(non-GPU)Message Passing Interface (MPI)Originally aimed at distributed memory architectures, now very effective on shared memoryOpenMPThreadsPthreads (“P” comes from Posix)Cell threadsParallel LibrariesIntel’s Thread Building Blocks (TBB) - matureMicrosoft’s Task Parallel Library - matureSWARM (GTech) – small scopeCharm++ (UIUC) – growing, undergoing effortSTAPL (Standard Template Adaptive Parallel Library, B. Stroustrup Texas A&M) – undergoing effortSlide Source: John Cavazos3GPU Parallel Programming SupportCUDA (NVIDIA)C/C++ extensionsBrook (Stanford)Relies on language extensions Draws on OpenGL v1.3+, DirectX v9+ or AMD's Close to Metal for the computational backend Runs on Windows and LinuxBrook+ (AMD/ATI)AMD-enhanced implementation of BrookSH (Waterloo)Now RapidMind, commercial ventureLibrary and language extensionsWorks on multicores as wellPeakStreamNow defunct, acquired by Google, June 20074Why Dedicate So Much Time to GPU?It’s fast for a variety of jobsReally good for data parallelism (requires SIMD)Bad for task parallelism (requires MIMD)It’s cheap to get one ($120 to $480)It’s everywhereThere is incentive to produce software since there are many potential users of it…5GPU vs. CPU Flop Rate Comparison6Fastest Computers in the World7IBM BlueGene/LEntry model: 1024 dual core nodes5.7 Tflop (compare to 0.9 Tflop for GPU)Tweaked OSDedicated power management solutionDedicated IT supportRather limited options for productivity tools (debugging, profiling, etc.)TotalView Price (2007): $1.4 million445-teraflops Blue Gene/P,Argonne National Lab 8Why GPU?Commodity high-performance computing solution for fine grain parallelismI’m not going to afford to buy an BlueGene for my lab to run collision detection, sort large arrays, run prefix sum jobs, etc.The MIMD hardware is suited for coarse grain parallelism, which is important for a complementary class of applicationsMore than 60 million NVIDIA GPU cards in use todayThe only other alternative that looks very attractive is Cell computingSony-Toshiba-IBM joint venture for video gaming“A Rough Guide to Scientific Computing On the PlayStation 3” Technical Report UT - CS - 07 - 595 Version 1 . 0 by Alfredo Buttari, Piotr Luszczek, Jakub Kurzak, Jack Dongarra, George Bosilca9Four GPU Lab DesktopPrice: $5000 (compare to 1.4 million for IBM)Would have made the 2005 list of 500 fastest supercomputers in the worldAssembled by former undergraduate student (your grader…)Theoretical peak flop rate: 1.4 Tflop10End: Overview of H&S for parallel computing Beginning: GPU Computing, CUDA Programming Model11AcknowledgementsMany slides herein include material developed at the University of Illinois Urbana-Champaign by Professor W. Hwu and Adjunct Professor David Kirk (the latter also the Chief Scientist at NVIDIA).The material they developed was presented as part of the Electrical and Computer Engineering class ECE498 in the Fall of 2007The slides are used with the permission of the authors, which is gratefully acknowledgedSlides that include material produced by professors Hwu and Kirk contain a HK-UIUC logo in the lower left corner of the slide12Layout of Typical Hardware ArchitectureCPU(host)GPU w/ local DRAM(device)13HK-UIUCGPU Computing – The Basic Idea Conceptually, the idea is simple:The GPU is connected to the CPU by a reasonable fast bus (8.5 GB/s is typical today)The idea is to use the GPU as a co-processorFarm out big parallelizable tasks to the GPUKeep the CPU busy with the control of the execution and “corner” tasksYou have to copy data in the GPU, and fetch results back. It’s ok if this data transfer is overshadowed by the number crunching done using that dataRemember Amdahl’s law…The GPU can be used for more than just generating graphics: the computational resources are there, and they are most of the time underutilizedWelcome to General Purpose GPU (GPGPU) computing14What is GPGPU ?General Purpose computation using GPU in applications other than 3D graphicsGPU accelerates critical path of applicationData parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computationApplications – see //GPGPU.orgGame effects, image processingPhysical modeling, computational engineering, matrix algebra, convolution, correlation, sorting15HK-UIUCWhy GPU for Computing?GPU is fastMassively parallelCPU : ~4 @ 3.0 Ghz (Intel Quad Core)GPU : ~240 @ 0.6 Ghz (Nvidia GTX 280)The 0.6 is the core clock. The shader clock operates at 1.3 GHzHigh memory bandwidthCPU : 21 GB/sGPU : 141.7 GB/sSimple architecture optimized for compute intensive taskProgrammableNVIDIA
View Full Document