DOC PREVIEW
OSU CS 419 - LECTURE NOTES

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1mjb – November 21, 2011Oregon State UniversityComputer GraphicsNVIDIA’s Compute Unified Device Architecture (CUDA)Mike [email protected] State Universitymjb – November 21, 2011Oregon State UniversityComputer GraphicsGeneral ProgrammabilitySpeedReaching the Promised LandNVIDIA GPUsCUDAIntel CPUsKnights Corner2mjb – November 21, 2011Oregon State UniversityComputer GraphicsGFLOPSG80 = GeForce 8800 GTXG71 = GeForce 7900 GTXG70 = GeForce 7800 GTXNV40 = GeForce 6800 UltraNV35 = GeForce FX 5950 UltraNV30 = GeForce FX 5800History of GPU Performance vs. CPU PerformanceSource: NVIDIAmjb – November 21, 2011Oregon State UniversityComputer GraphicsDue to the nature of graphics computations, GPU chips are customized to handle streaming data. This means that the data is already sequential and thus the GPU chips do not need the significant amount of cache space that dominates the real estate on general-purpose CPU chips. The GPU die real estate can then be re-targeted to hold more cores and thus to produce more processing power.Why have GPUs Been Outpacing CPUs?For example. while Intel and AMD are now shipping CPU chips with 4 cores, NVIDIA is shipping GPU chips with 512. Overall, in four years, GPUs have achieved a 17.5-fold increase in performance, a compound annual increase of 2.05X.The other reason is that general CPU chips contain on-chip logic to process some instructions out-of-order if the CPU is blocked waiting on something (e.g., a memory fetch). This, too, takes up chip die space.3mjb – November 21, 2011Oregon State UniversityComputer GraphicsHow Can You Gain Access to that GPU Power?1. Write a graphics display program (≥ 1985)2. Write an application that looks like a graphics display program (≥ 2002)3. Write in CUDA, which looks like C++ (≥ 2006)mjb – November 21, 2011Oregon State UniversityComputer GraphicsCUDA Architecture• The GPU has some number of MultiProcessors (MPs), depending on the model• The NVIDIA Fermi 480 and above have 16 MPs• An 8000-based MP has 8 independent processors (cores)• A Fermi-based MP has 32 independent processors (cores)• Memory is divided into Shared Memory and Constant MemoryMPsCores4mjb – November 21, 2011Oregon State UniversityComputer GraphicsThe CUDA ParadigmC++ Program with both host and CUDA code in itCompiler and LinkerCPU binary on the hostCUDA binary on the GPUCUDA is an NVIDIA-only product, but it is very popular, and got the GPU-as-CPU ball rolling, which has resulted in other products like OpenCLCompiler and LinkerHost code CUDA codemjb – November 21, 2011Oregon State UniversityComputer GraphicsIf GPUs have so Little Cache,how can they Execute General C++ Code Efficiently?1. Multiple Multiprocessors2. Threads – lots and lots of threads• CUDA expects you to not just have a few threads, but to have thousands of them!• All threads execute the same code (called the kernel), but operate on different data• Each thread can figure out which number it is, and thus what its job is•Think of all the threads as living in a “pool”, waiting to be executed• All processors start by grabbing a thread from the pool• When a thread gets blocked somehow (a memory access, waiting for information from another thread, etc.), the processor quickly returns the thread to the pool and grabs another one to work on.• This thread-swap happens within a single cycleA full memory access requires 200 instruction cycles to complete5mjb – November 21, 2011Oregon State UniversityComputer GraphicsSo, the Trick is to Break your Probleminto Many, Many Small PiecesParticle Systems are a great example.1. Have one thread per each particle.2. Put all of the initial parameters into an array in GPU memory.3. Tell each thread what the current Time is.4. Each thread then computes its particle’s position, color, etc. and writes it into arrays in GPU memory.5. The CPU program then initiates drawing of the information in those arrays.Note: once setup, the data never leaves GPU memory!Ben


View Full Document

OSU CS 419 - LECTURE NOTES

Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?