CSE 591: GPU ProgrammingIntroductionKlaus MuellerComputer Science DepartmentStony Brook UniversityEntertainment Graphics: Virtual Realism for the MassesComputer games need to have:•realistic appearance of characters and objects• believable and creative shading, textures, surface details• realistic physics and kinematics • effects need to be customizable and interactive High Performance Computing on the DesktopPC graphics boards featuring GPUs:•NVidia FX, ATI Radeon• available at every computer store for less than $500• set up your PC in less than an hour and playthe latest board: Nvidia GeForce GTX 280“Just” ComputingCompute-only (no graphics): NVIDIA Tesla 1060True GPGPU (General Purpose Computing using GPU Technology)Bundle up to 4 cards: 960 processors, 16 GB memory4 GB memory per cardIncredible GrowthPerformance gap GPU / CPU is growing• currently 1-2 orders of magnitude is achievable (given appropriate programming and problem decomposition)1.3 TFLOPS(GTX 480) 500$March2010History: Accelerated Graphics1990s: rise (and fall) of Silicon Graphics (SGI)•in 1991 pioneers the first graphics accelerator• developed OpenGL• evolving 2D to 3D graphics capabilities through mid-90s• desktop to high-end performance workstations (O2, Octane, Onyx) • #1: expensive• #2: non-programmableThe Graphics PipelineOld-style, non-programmable:texturesstream of(X,Y,Z,W)stream of(R,G,B,A)History: Cheap Consumer GraphicsLate 1990s: rise of consumer graphics chips•1994: Voodoo graphics chip (and later the popular Voodoo 2)• chips still separate from memory• other chips soon emerge: ATI Rage, NVIDIA RivaEnd 1990s: consumer graphics boards with high-end graphics•transform, lighting, setup, and rendering all on a single GPU• the world’s first GPU: NVIDIA GeForce 256 (NV 10)• Æ #1: inexpensive• #2: non-programmableHistory: Programmability, GPGPU2000s: emergence of programmable consumer graphics hardware•programmable vertex and fragment shaders• graphics cards: NVIDIA GeForce 3, ATI Radeon 9700• evolving capabilities for floating point, loops, if’s• enabled GPGPU• HW programming languages: CG, GLSL, HLSL• SW graphics API: OpenGL, DirectX• Æ #1: inexpensive• Æ #2: programmableThe Graphics PipelineModern, programmable:programmabletexturesstream of(X,Y,Z,W)stream of(R,G,B,A)History: Focus Parallel Computing2006: parallel computing languages appear•address the need to provide dedicated SDK and API for parallel high performance computing (GPGPU)• CUDA (Compute Unified Device Architecture)- developed by NVIDIA• OpenCL (Open Computing Language)- initially developed by Apple- now with the Khronos Compute Working Group• specific GPGPU boards: NVIDIA Tesla, AMD FireStream• other parallel-computing chips: Intel Larabee, IBM Cell BEHardware ArchitectureGPU •Compute-intensive• Highly data parallel • Original SIMDProgramming language•Expose the parallel capabilities of GPUs.ALU: Arithmetic logic unitGPU Vital SpecsGeForce 8800 GTX GeForce GTX 280 GeForce GTX 480Codename G80 D10U-23 GF100Release date 11/2006 6/2008 3/2010Transistors 681 M (90nm) 1400 M (65nm) 3000 M (40nm)Clock speed 1350 MHz 1296 MHz 1401 MHzProcessors 128 240 480Peak pixel fill rate 13.8 Gigapixels/s 19.3 Gigapixels/s 33.60 Gigapixels/sPk memory bandwidth 86.4 GB/s (384 bit) 141.7 GB/s (512 bit) 177.4 GB/s (384bit)Memory 768 MB 1024 MB 1536 MBPeak performance 520 Gigaflops 933 Gigaflops 1.3TeraflopsComparison with CPUsIntel Core 2 QuadGeForce GTX 280GeForce GTX 480Core 4 8x30 32x15SIMD width / Core43232Managed / executed threadsNA 32 48Clock speed 3 GHz 1.3GHz 1.4 GHzPerformance 96 Gigaflops 933 Gigaflops 1345 GigaflopsGPU vs. CPUHighly parallel•GeForce 8800 has 128 processors (128-way parallel)Memory very close to processors Æ fast data transfer•CPU requires lots of cache logic and communication High % of GPU chip real-estate for computing• small in CPUs (example, 6.5% in Intel Itanium)In many cases speedups of 1-2 orders of magnitude can be obtained by porting to GPU•more details on the rules for effective porting laterGPU Architecture: Overview128 processors Æ 8 multi-processors of 16 processors each local cache L1 (4k)shared cache L2 (1M)DRAM (global memory)Global MemoryGPU Architecture: Overview128 processors Æ 8 multi-processors of 16 processors each local cache L1 (4k)shared cache L2 (1M)DRAM (global memory)Global MemoryMemory management is key! GPU Architecture: Overview128 processors Æ 8 multi-processors of 16 processors each local cache L1 (4k)shared cache L2 (1M)DRAM (global memory)Global MemoryMemory management is key! Thread management is key! GPU Architecture: Different ViewEach multiprocessor is a SIMD (Same Instruction, Multiple Data) architectureThe (multi-processor level) shared Constant Cache and Texture CacheThe (device-level shared) Device Memory (Global Memory) has read-write accessEquipped with a set of local 32-bit registers (L1 and L2 caches)History: Focus “Serious” Computing2009: next generation CUDA architectures announced•NVIDIA Fermi, AMD Cypress• substrate for supercomputing• focused on “serious” high performance computing (clusters, etc)Enrico Fermi (1901-1954)•Italian physicist • one of the top scientists of the 20th century• developed the first nuclear reactor• contributed to - quantum theory, statistical mechanics- nuclear and particle physics• Nobel Prize in Physics in 1938 for his work on induced radioactivityGPU SpecificsAll standard graphics ops are hardwired•linear interpolations• matrix and vector arithmetic (+, -, *)Arithmetic intensity•the ratio of ALU arithmetic per operand fetched• needs to be reasonably high, else application is memory-boundGPU memory 1-2 orders of magnitude slower than GPU processors•computation often better than table look-ups• indirections can be expensiveBe aware of GPU 2D-caching protocol (for texture memory)•data is fetched in 2D tiles (recall graphics bilinear texture filtering) • promote data locality in 2D tilesLatency HidingGPUs provide hardware multi-threading•kicks in when a thread within a core ALU stalls (waiting for memory, etc) • then another SIMD thread is swapped in for execution• this hides the latency for the stalled thread• GPU allows many threads to be maintained than SIMD- executedHardware multi-threading requires memory•contexts of all such threads must be maintained in register or memory•
View Full Document