SBU CSE 591 - GPU Programming Introduction

Unformatted text preview:

CSE 591: GPU ProgrammingIntroductionKlaus MuellerComputer Science DepartmentStony Brook UniversityEntertainment Graphics: Virtual Realism for the MassesComputer games need to have:•realistic appearance of characters and objects• believable and creative shading, textures, surface details• realistic physics and kinematics • effects need to be customizable and interactive High Performance Computing on the DesktopPC graphics boards featuring GPUs:•NVidia FX, ATI Radeon• available at every computer store for less than $500• set up your PC in less than an hour and playthe latest board: Nvidia GeForce GTX 280“Just” ComputingCompute-only (no graphics): NVIDIA Tesla 1060True GPGPU (General Purpose Computing using GPU Technology)Bundle up to 4 cards: 960 processors, 16 GB memory4 GB memory per cardIncredible GrowthPerformance gap GPU / CPU is growing• currently 1-2 orders of magnitude is achievable (given appropriate programming and problem decomposition)1.3 TFLOPS(GTX 480) 500$March2010History: Accelerated Graphics1990s: rise (and fall) of Silicon Graphics (SGI)•in 1991 pioneers the first graphics accelerator• developed OpenGL• evolving 2D to 3D graphics capabilities through mid-90s• desktop to high-end performance workstations (O2, Octane, Onyx) • #1: expensive• #2: non-programmableThe Graphics PipelineOld-style, non-programmable:texturesstream of(X,Y,Z,W)stream of(R,G,B,A)History: Cheap Consumer GraphicsLate 1990s: rise of consumer graphics chips•1994: Voodoo graphics chip (and later the popular Voodoo 2)• chips still separate from memory• other chips soon emerge: ATI Rage, NVIDIA RivaEnd 1990s: consumer graphics boards with high-end graphics•transform, lighting, setup, and rendering all on a single GPU• the world’s first GPU: NVIDIA GeForce 256 (NV 10)• Æ #1: inexpensive• #2: non-programmableHistory: Programmability, GPGPU2000s: emergence of programmable consumer graphics hardware•programmable vertex and fragment shaders• graphics cards: NVIDIA GeForce 3, ATI Radeon 9700• evolving capabilities for floating point, loops, if’s• enabled GPGPU• HW programming languages: CG, GLSL, HLSL• SW graphics API: OpenGL, DirectX• Æ #1: inexpensive• Æ #2: programmableThe Graphics PipelineModern, programmable:programmabletexturesstream of(X,Y,Z,W)stream of(R,G,B,A)History: Focus Parallel Computing2006: parallel computing languages appear•address the need to provide dedicated SDK and API for parallel high performance computing (GPGPU)• CUDA (Compute Unified Device Architecture)- developed by NVIDIA• OpenCL (Open Computing Language)- initially developed by Apple- now with the Khronos Compute Working Group• specific GPGPU boards: NVIDIA Tesla, AMD FireStream• other parallel-computing chips: Intel Larabee, IBM Cell BEHardware ArchitectureGPU •Compute-intensive• Highly data parallel • Original SIMDProgramming language•Expose the parallel capabilities of GPUs.ALU: Arithmetic logic unitGPU Vital SpecsGeForce 8800 GTX GeForce GTX 280 GeForce GTX 480Codename G80 D10U-23 GF100Release date 11/2006 6/2008 3/2010Transistors 681 M (90nm) 1400 M (65nm) 3000 M (40nm)Clock speed 1350 MHz 1296 MHz 1401 MHzProcessors 128 240 480Peak pixel fill rate 13.8 Gigapixels/s 19.3 Gigapixels/s 33.60 Gigapixels/sPk memory bandwidth 86.4 GB/s (384 bit) 141.7 GB/s (512 bit) 177.4 GB/s (384bit)Memory 768 MB 1024 MB 1536 MBPeak performance 520 Gigaflops 933 Gigaflops 1.3TeraflopsComparison with CPUsIntel Core 2 QuadGeForce GTX 280GeForce GTX 480Core 4 8x30 32x15SIMD width / Core43232Managed / executed threadsNA 32 48Clock speed 3 GHz 1.3GHz 1.4 GHzPerformance 96 Gigaflops 933 Gigaflops 1345 GigaflopsGPU vs. CPUHighly parallel•GeForce 8800 has 128 processors (128-way parallel)Memory very close to processors Æ fast data transfer•CPU requires lots of cache logic and communication High % of GPU chip real-estate for computing• small in CPUs (example, 6.5% in Intel Itanium)In many cases speedups of 1-2 orders of magnitude can be obtained by porting to GPU•more details on the rules for effective porting laterGPU Architecture: Overview128 processors Æ 8 multi-processors of 16 processors each local cache L1 (4k)shared cache L2 (1M)DRAM (global memory)Global MemoryGPU Architecture: Overview128 processors Æ 8 multi-processors of 16 processors each local cache L1 (4k)shared cache L2 (1M)DRAM (global memory)Global MemoryMemory management is key! GPU Architecture: Overview128 processors Æ 8 multi-processors of 16 processors each local cache L1 (4k)shared cache L2 (1M)DRAM (global memory)Global MemoryMemory management is key! Thread management is key! GPU Architecture: Different ViewEach multiprocessor is a SIMD (Same Instruction, Multiple Data) architectureThe (multi-processor level) shared Constant Cache and Texture CacheThe (device-level shared) Device Memory (Global Memory) has read-write accessEquipped with a set of local 32-bit registers (L1 and L2 caches)History: Focus “Serious” Computing2009: next generation CUDA architectures announced•NVIDIA Fermi, AMD Cypress• substrate for supercomputing• focused on “serious” high performance computing (clusters, etc)Enrico Fermi (1901-1954)•Italian physicist • one of the top scientists of the 20th century• developed the first nuclear reactor• contributed to - quantum theory, statistical mechanics- nuclear and particle physics• Nobel Prize in Physics in 1938 for his work on induced radioactivityGPU SpecificsAll standard graphics ops are hardwired•linear interpolations• matrix and vector arithmetic (+, -, *)Arithmetic intensity•the ratio of ALU arithmetic per operand fetched• needs to be reasonably high, else application is memory-boundGPU memory 1-2 orders of magnitude slower than GPU processors•computation often better than table look-ups• indirections can be expensiveBe aware of GPU 2D-caching protocol (for texture memory)•data is fetched in 2D tiles (recall graphics bilinear texture filtering) • promote data locality in 2D tilesLatency HidingGPUs provide hardware multi-threading•kicks in when a thread within a core ALU stalls (waiting for memory, etc) • then another SIMD thread is swapped in for execution• this hides the latency for the stalled thread• GPU allows many threads to be maintained than SIMD- executedHardware multi-threading requires memory•contexts of all such threads must be maintained in register or memory•


View Full Document
Download GPU Programming Introduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view GPU Programming Introduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view GPU Programming Introduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?