DOC PREVIEW
GT ECE 4893 - LECTURE NOTES
School name Georgia Tech
Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

“Classic” GPGPUProf. Aaron LantermanSchool of Electrical and Computer EngineeringGeorgia Institute of Technology2“Classic” vs. “Modern” GPGPU• “Classic GPGPU”– User must map their algorithm to a graphics framework notdesigned with general computation in mind– GPGPU implementations on the Playstation 3s RSX and theXbox 360’s Xenos will need to be programmed in this mindset– Many PC users will have older graphics cards• “Modern GPGPU”– Runs on hardware (e.g., NVIDIA G80) specifically designed to befriendly to GPGPU computations– NVIDIA’s Compute Unified Device Architecture (CUDA)framework hides low-level details• See Hyesoon Kim’s Spring 2008 class, CS8803SC: Software andHardware Cooperative Computing– ATI’s CTM (Close-to-Metal) provides powerful low-level access toGPGPU-friendly hardware, but requires users to “roll their own”3Typical classic GPGPU setup• Load inputs into textures– GPU textures <-> CPU arrays– Texture coordinates <-> computational domain• Draw a quadrangle• Do computation in pixel shaders– Older GPUs typically have more pixel shader unitsthan vertex shader units– Older GPUs can’t do texture fetch in vertex shaders– GPU pixel shaders <-> CPU inner loops– Texture sample (tex2D) <-> CPU memory read• Render output as pixels in the quadrangle– Render-to-texture <-> CPU memory write– Vertex coordinates <-> computational range4Reference• A lot of this discussion is inspired by variouschapters in GPU Gems 2– Excellent reference for “classic GPGPU”– GPU Gems 3 focuses more on “modern GPGPU”solutions5Scattering vs. gathering on a typical GPU• Gather:– Indirect reada = x[i]• Texture fetchtex2D(x,...)• Scatter:– Indirect write a[i] = x• No arbitrary texturewrites on the GPU!– Can do “orderedwrites” with “renderto texture”6Mapping• Apply some function to each element inparallel• Store input values as textels in a texture• Draw quadrangle with as many pixels astextels in input• Pixel shader just computes the function• Output values in pixels of renderedquadrangle– Use “render-to-texture” to use mapped valuesin another stage7Reduction• Associative operations– Ex: sums, products, min, max• Do multiple passes, rendering to smaller textures witheach pass72 15153264 102 8 301134 47 218 5226 19 38 9883819388Min example• Could do more than four per pixel– Depending on how many “texture fetches” are allowed– Less passes, but increased time per pass8Flow control is tricky• Most older GPUs don’t “really” support branching• Loops typically “unrolled” by the compiler• Predicated branching:– Compute both parts of the branch– Use results from only one branch• On some GPUs, vertex shaders supportbranching but pixel shaders don’t• Even newer GPUs that directly support branchingmay give you a significant performance hit– Parallel execution units may be restricted to executingonly one branch at a time (locality is important)9Static branch resolution• Instead of branching in the pixelshader, execute different pixel shaderson different output quadrangles• Ex: boundary conditions in PDEInfo & examples fromp. 549-550 of GPU Gems 210Skipping unnecesary work with Z-cull• Setup– Write 0 to z-buffer of pixels where you want to skipcomputation– Write 1 to z-buffer of pixels where you do• Feed pixel shader doing the computation z-buffervalue of 0.5• Ex: landlocked cells in fluid simulation• Warning: GPU may do z-culling at coarserresolution than the pixels– Will skip shading only if all pixels in a region fail thedepth test• Can do similar tricks with alpha stencilingInfo & examples fromp. 550-552 of GPU Gems 211Applications tailor-made for GPGPU• Applications with high compute-to-communication ratios• Partial differential equations• Cellular automata– Sim City!• Linear algebra– Chapter 44 of GPU Gems 2– There are clever ways to handle bandedmatrices, sparce matrices, etc.12GPGPU applications in games• Collision detection• Physics– Rigid bodies– Fluids, clouds, smoke, cloth• Particle systems• Line of sight calculations for AI?• Aside:– AGEIA is marketing a Physics Processing Unit(PPU) to accelerate their “PhysX” SDK– Havok FX can exploit NVIDIA and ATI GPUs13General advice• Quite often better to store long 1D arrays as“wrapped” 2D arrays– 1D arrays limited in length– GPUs seem to be faster at handling 2D texturesthan 1D• Pre-compute constants on the CPU• Pre-compute low-dimensional functions andstore them as textures– GPU will naturally do an interpolated lookup14Limitations of GPGPU• 32-bit floats (used to be worse!)– There’s talk about 64-bit floats, but can the mass marketsupport that?• May not have integers– Can be a problem with precise texture accesses• May not have bitwise operations• Not-so-great with dynamic data structures (queues,stacks, trees, etc.)– But lots of clever people have come up with tricks• Need to outside the box: strange algorithms thatmight be silly on a CPU might map well to a GPU15Some successful GPGPU applications• Computed Tomography (CT)• MRI (application of FFTs)• Phase unwrapping for Synthetic ApertureRadar (SAR)– 35x speedup obtained by Peter Karasev, DanCampbell, and Mark Richards• Data mining• Raytracing– Remember the line-triangle intersectionlecture?16Quake 3: RaytracedImages from


View Full Document

GT ECE 4893 - LECTURE NOTES

Documents in this Course
Load more
Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?