UW-Madison ME 964 - Parallel Programming Support - D2350151

Home> Schools> University of Wisconsin, Madison> Mechanical Engineering (ME) > ME 964> Parallel Programming Support

DOC PREVIEW

UW-Madison ME 964 - Parallel Programming Support

School name University of Wisconsin, Madison

Course Me 964- High Performance Computing for Engineering Applications

Pages 29

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ME964 High Performance Computing for Engineering ApplicationsBefore we get started…Parallel Programming Support (non-GPU)GPU Parallel Programming SupportWhy Dedicate So Much Time to GPU?GPU vs. CPU Flop Rate ComparisonFastest Computers in the WorldIBM BlueGene/LWhy GPU?Four GPU Lab DesktopEnd: Overview of H&S for parallel computing Beginning: GPU Computing, CUDA Programming ModelAcknowledgementsLayout of Typical Hardware ArchitectureGPU Computing – The Basic IdeaWhat is GPGPU ?Why GPU for Computing?GPGPU ConstraintsCUDAParallel Computing on a GPURunning code on Parallel ComputersThe CUDA Way: Extended CCompiling CUDAMore on the nvcc compilerCompiling CUDA extended CBegin: Memory Layout on the GPUCUDA Programming Model: A Highly Multithreaded CoprocessorBefore We Dive In: An Overview of the GeForce-8 Series HWGPU Processor TerminologyExecution Configuration: Grids and BlocksME964High Performance Computing for Engineering ApplicationsC Programming Intro, wrap up &Brief Overview of Parallel ComputingSept. 11, 2008Before we get started…Last Time Wrapped up the C Programming IntroOverview of Parallel ComputingThe Sequential vs. Parallel paradigm in computingThe Hardware element of the Parallel Computing equationThe Software element of the Parallel Computing equationTodayHW1 due at 11:59 PMHW2 assigned (available on the website)Start discussion on GPU as our vehicle for parallel computingTrace back the evolution of the GPUMemory layout, NVIDIA GPUThe CUDA API2Parallel Programming Support(non-GPU)Message Passing Interface (MPI)Originally aimed at distributed memory architectures, now very effective on shared memoryOpenMPThreadsPthreads (“P” comes from Posix)Cell threadsParallel LibrariesIntel’s Thread Building Blocks (TBB) - matureMicrosoft’s Task Parallel Library - matureSWARM (GTech) – small scopeCharm++ (UIUC) – growing, undergoing effortSTAPL (Standard Template Adaptive Parallel Library, B. Stroustrup Texas A&M) – undergoing effortSlide Source: John Cavazos3GPU Parallel Programming SupportCUDA (NVIDIA)C/C++ extensionsBrook (Stanford)Relies on language extensions Draws on OpenGL v1.3+, DirectX v9+ or AMD's Close to Metal for the computational backend Runs on Windows and LinuxBrook+ (AMD/ATI)AMD-enhanced implementation of BrookSH (Waterloo)Now RapidMind, commercial ventureLibrary and language extensionsWorks on multicores as wellPeakStreamNow defunct, acquired by Google, June 20074Why Dedicate So Much Time to GPU?It’s fast for a variety of jobsReally good for data parallelism (requires SIMD)Bad for task parallelism (requires MIMD)It’s cheap to get one ($120 to $480)It’s everywhereThere is incentive to produce software since there are many potential users of it…5GPU vs. CPU Flop Rate Comparison6Fastest Computers in the World7IBM BlueGene/LEntry model: 1024 dual core nodes5.7 Tflop (compare to 0.9 Tflop for GPU)Tweaked OSDedicated power management solutionDedicated IT supportRather limited options for productivity tools (debugging, profiling, etc.)TotalView Price (2007): $1.4 million445-teraflops Blue Gene/P,Argonne National Lab 8Why GPU?Commodity high-performance computing solution for fine grain parallelismI’m not going to afford to buy an BlueGene for my lab to run collision detection, sort large arrays, run prefix sum jobs, etc.The MIMD hardware is suited for coarse grain parallelism, which is important for a complementary class of applicationsMore than 60 million NVIDIA GPU cards in use todayThe only other alternative that looks very attractive is Cell computingSony-Toshiba-IBM joint venture for video gaming“A Rough Guide to Scientific Computing On the PlayStation 3” Technical Report UT - CS - 07 - 595 Version 1 . 0 by Alfredo Buttari, Piotr Luszczek, Jakub Kurzak, Jack Dongarra, George Bosilca9Four GPU Lab DesktopPrice: $5000 (compare to 1.4 million for IBM)Would have made the 2005 list of 500 fastest supercomputers in the worldAssembled by former undergraduate student (your grader…)Theoretical peak flop rate: 1.4 Tflop10End: Overview of H&S for parallel computing Beginning: GPU Computing, CUDA Programming Model11AcknowledgementsMany slides herein include material developed at the University of Illinois Urbana-Champaign by Professor W. Hwu and Adjunct Professor David Kirk (the latter also the Chief Scientist at NVIDIA).The material they developed was presented as part of the Electrical and Computer Engineering class ECE498 in the Fall of 2007The slides are used with the permission of the authors, which is gratefully acknowledgedSlides that include material produced by professors Hwu and Kirk contain a HK-UIUC logo in the lower left corner of the slide12Layout of Typical Hardware ArchitectureCPU(host)GPU w/ local DRAM(device)13HK-UIUCGPU Computing – The Basic Idea Conceptually, the idea is simple:The GPU is connected to the CPU by a reasonable fast bus (8.5 GB/s is typical today)The idea is to use the GPU as a co-processorFarm out big parallelizable tasks to the GPUKeep the CPU busy with the control of the execution and “corner” tasksYou have to copy data in the GPU, and fetch results back. It’s ok if this data transfer is overshadowed by the number crunching done using that dataRemember Amdahl’s law…The GPU can be used for more than just generating graphics: the computational resources are there, and they are most of the time underutilizedWelcome to General Purpose GPU (GPGPU) computing14What is GPGPU ?General Purpose computation using GPU in applications other than 3D graphicsGPU accelerates critical path of applicationData parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computationApplications – see //GPGPU.orgGame effects, image processingPhysical modeling, computational engineering, matrix algebra, convolution, correlation, sorting15HK-UIUCWhy GPU for Computing?GPU is fastMassively parallelCPU : ~4 @ 3.0 Ghz (Intel Quad Core)GPU : ~240 @ 0.6 Ghz (Nvidia GTX 280)The 0.6 is the core clock. The shader clock operates at 1.3 GHzHigh memory bandwidthCPU : 21 GB/sGPU : 141.7 GB/sSimple architecture optimized for compute intensive taskProgrammableNVIDIA

View Full Document