DOC PREVIEW
UW-Madison ME 964 - ME 964 Lecture Notes

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

ME964High Performance Computing for Engineering Applications“Computers are useless. They can only give you answers.” Pablo Picasso© Dan Negrut, 2011ME964 UW-MadisonExecution Scheduling in CUDARevisiting Memory Issues in CUDAFebruary 17, 2011Before We Get Started… Last time Wrapped up tiled matrix-matrix multiplication using shared memory Shared memory used to reduce some of the pain associated with global memory accesses Discussed thread scheduling for execution on the GPU Today Wrap up discussion about execution scheduling on the GPU Discuss global memory access issues in CUDA Other issues HW3 due tonight at 11:59 PM Use Learn@UW drop-box to submit homework Note that timing should be done as shown on slide 30 of the pdf posted for lecture 02/08 Building of CUDA code in Visual Studio: issues related to “include” of a file that hosts the kernel definition (either don’t include, or force linking) HW4 was posted. Due date: 02/22, 11:59 PM  Please indicate your preference for midterm project on the forum2Thread Scheduling/Execution Each Thread Block is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model  Warps are the basic scheduling units in SM Example (draws on figure at right):  Assume 2 blocks are processed by an SM and each Block has 512 threads, how many Warps are managed by the SM? Each Block is divided into 512/32 = 16 Warps There are 16 * 2 = 32 Warps  At any point in time, only *one* of the 32 Warps will be selected for instruction fetch and execution.…t0 t1 t2 … t31……t0 t1 t2 … t31…Block 1 Warps Block 2 WarpsSPSPSPSPSFUSPSPSPSPSFUInstruction Fetch/DispatchInstruction L1 Data L1Streaming MultiprocessorShared Memory3SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp on C1060 How is this relevant? Suppose your code has one global memory access every six simple instructions Then, a minimum of 17 Warps are needed to fully tolerate 400-cycle memory latency:warp 8 instruction 11SM multithreadedWarp schedulerwarp 1 instruction 42warp 3 instruction 35warp 8 instruction 12...timewarp 3 instruction 364HK-UIUC400/(6 * 4)=16.6667 ⇒ 17 WarpsSM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle From instruction L1 cache  Into any instruction buffer slot Issue one “ready-to-go” warp instruction per 4 cycles From any warp - instruction buffer slot Operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp SM broadcasts the same instruction to 32 Threads of a WarpI$L1MultithreadedInstruction BufferRFC$L1SharedMemOperand SelectMAD SFU5HK-UIUCScoreboarding Used to determine whether a thread is ready to execute A scoreboard is a table in hardware that tracks Instructions being fetched, issued, executed  Resources (functional units and operands) needed by instructions Which instructions modify which registers Old concept from CDC 6600 (1960s) to separate memory and computationMary Hall, U-Utah6Scoreboarding from Example Consider three separate instruction streams: warp1, warp3 and warp8warp 8 instruction 11warp 1 instruction 42warp 3 instruction 95warp 8 instruction 12...warp 3 instruction 96t=kt=k+1t=k+2t=l>kt=l+1Warp CurrentInstructionInstructionStateWarp 1 42 ComputingWarp 3 95 ComputingWarp 8 11 Operands ready to go…Scheduleat time kMary Hall, U-UtahScoreboarding from Example Consider three separate instruction streams: warp1, warp3 and warp8warp 8 instruction 11warp 1 instruction 42warp 3 instruction 95warp 8 instruction 12...warp 3 instruction 96t=kt=k+1t=k+2t=l>kt=l+1Warp CurrentInstructionInstructionStateWarp 1 42 Ready to write result Warp 3 95 ComputingWarp 8 11 Computing…Scheduleat time k+1Mary Hall, U-UtahScoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Status becomes “ready” after the needed values are deposited Prevents hazards Cleared instructions are eligible for issue Decoupled Memory/Processor pipelines Any thread can continue to issue instructions until scoreboarding prevents issue9HK-UIUCGranularity Considerations[NOTE: Specific to Tesla C1060] For Matrix Multiplication, should I use 8X8, 16X16 or 32X32 tiles? For 8X8, we have 64 threads per Block. Since each Tesla C1060 SM can manage up to 1024 threads, it could take up to 16 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 1024 threads, it can take up to 4 Blocks unless other resource considerations overrule. For 32X32, we have 1024 threads per Block. This is not an option anyway (we need less then 512 per block) NOTE: this type of thinking should be invoked for your target hardware (from where the need for auto-tuning software…)10ILP vs. TLP Example Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 20 registers  Also, assume global loads have an associated overhead of 400 cycles  3 Blocks can run on each SM If a Compiler can use two more registers to change the dependence pattern so that 8 independent instructions exist (instead of 4) for each global memory load Only two blocks can now run on each SM However, one only needs 400 cycles/(8 instructions *4 cycles/instruction) ≈ 13 Warps to tolerate the memory latency Two Blocks have 16 Warps. The performance can be actually higher!11Summing It Up… When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to SMs with available execution capacity The threads of a block execute concurrently on one SM, and multiple blocks (up to 8) can execute concurrently on one SM When a thread block finishes, a new block is launched on the vacated SM12A Word on HTT[Detour: slide 1/2] The traditional host processor (CPU) may stall due to a cache miss, branch misprediction, or data dependency


View Full Document

UW-Madison ME 964 - ME 964 Lecture Notes

Documents in this Course
Load more
Download ME 964 Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ME 964 Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ME 964 Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?