ME964High Performance Computing for Engineering Applications“Computers are useless. They can only give you answers.” Pablo Picasso© Dan Negrut, 2011ME964 UW-MadisonExecution Scheduling in CUDARevisiting Memory Issues in CUDAFebruary 17, 2011Before We Get Started… Last time Wrapped up tiled matrix-matrix multiplication using shared memory Shared memory used to reduce some of the pain associated with global memory accesses Discussed thread scheduling for execution on the GPU Today Wrap up discussion about execution scheduling on the GPU Discuss global memory access issues in CUDA Other issues HW3 due tonight at 11:59 PM Use Learn@UW drop-box to submit homework Note that timing should be done as shown on slide 30 of the pdf posted for lecture 02/08 Building of CUDA code in Visual Studio: issues related to “include” of a file that hosts the kernel definition (either don’t include, or force linking) HW4 was posted. Due date: 02/22, 11:59 PM Please indicate your preference for midterm project on the forum2Thread Scheduling/Execution Each Thread Block is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are the basic scheduling units in SM Example (draws on figure at right): Assume 2 blocks are processed by an SM and each Block has 512 threads, how many Warps are managed by the SM? Each Block is divided into 512/32 = 16 Warps There are 16 * 2 = 32 Warps At any point in time, only *one* of the 32 Warps will be selected for instruction fetch and execution.…t0 t1 t2 … t31……t0 t1 t2 … t31…Block 1 Warps Block 2 WarpsSPSPSPSPSFUSPSPSPSPSFUInstruction Fetch/DispatchInstruction L1 Data L1Streaming MultiprocessorShared Memory3SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp on C1060 How is this relevant? Suppose your code has one global memory access every six simple instructions Then, a minimum of 17 Warps are needed to fully tolerate 400-cycle memory latency:warp 8 instruction 11SM multithreadedWarp schedulerwarp 1 instruction 42warp 3 instruction 35warp 8 instruction 12...timewarp 3 instruction 364HK-UIUC400/(6 * 4)=16.6667 ⇒ 17 WarpsSM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle From instruction L1 cache Into any instruction buffer slot Issue one “ready-to-go” warp instruction per 4 cycles From any warp - instruction buffer slot Operand scoreboarding used to prevent hazards Issue selection based on round-robin/age of warp SM broadcasts the same instruction to 32 Threads of a WarpI$L1MultithreadedInstruction BufferRFC$L1SharedMemOperand SelectMAD SFU5HK-UIUCScoreboarding Used to determine whether a thread is ready to execute A scoreboard is a table in hardware that tracks Instructions being fetched, issued, executed Resources (functional units and operands) needed by instructions Which instructions modify which registers Old concept from CDC 6600 (1960s) to separate memory and computationMary Hall, U-Utah6Scoreboarding from Example Consider three separate instruction streams: warp1, warp3 and warp8warp 8 instruction 11warp 1 instruction 42warp 3 instruction 95warp 8 instruction 12...warp 3 instruction 96t=kt=k+1t=k+2t=l>kt=l+1Warp CurrentInstructionInstructionStateWarp 1 42 ComputingWarp 3 95 ComputingWarp 8 11 Operands ready to go…Scheduleat time kMary Hall, U-UtahScoreboarding from Example Consider three separate instruction streams: warp1, warp3 and warp8warp 8 instruction 11warp 1 instruction 42warp 3 instruction 95warp 8 instruction 12...warp 3 instruction 96t=kt=k+1t=k+2t=l>kt=l+1Warp CurrentInstructionInstructionStateWarp 1 42 Ready to write result Warp 3 95 ComputingWarp 8 11 Computing…Scheduleat time k+1Mary Hall, U-UtahScoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Status becomes “ready” after the needed values are deposited Prevents hazards Cleared instructions are eligible for issue Decoupled Memory/Processor pipelines Any thread can continue to issue instructions until scoreboarding prevents issue9HK-UIUCGranularity Considerations[NOTE: Specific to Tesla C1060] For Matrix Multiplication, should I use 8X8, 16X16 or 32X32 tiles? For 8X8, we have 64 threads per Block. Since each Tesla C1060 SM can manage up to 1024 threads, it could take up to 16 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 16X16, we have 256 threads per Block. Since each SM can take up to 1024 threads, it can take up to 4 Blocks unless other resource considerations overrule. For 32X32, we have 1024 threads per Block. This is not an option anyway (we need less then 512 per block) NOTE: this type of thinking should be invoked for your target hardware (from where the need for auto-tuning software…)10ILP vs. TLP Example Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 20 registers Also, assume global loads have an associated overhead of 400 cycles 3 Blocks can run on each SM If a Compiler can use two more registers to change the dependence pattern so that 8 independent instructions exist (instead of 4) for each global memory load Only two blocks can now run on each SM However, one only needs 400 cycles/(8 instructions *4 cycles/instruction) ≈ 13 Warps to tolerate the memory latency Two Blocks have 16 Warps. The performance can be actually higher!11Summing It Up… When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to SMs with available execution capacity The threads of a block execute concurrently on one SM, and multiple blocks (up to 8) can execute concurrently on one SM When a thread block finishes, a new block is launched on the vacated SM12A Word on HTT[Detour: slide 1/2] The traditional host processor (CPU) may stall due to a cache miss, branch misprediction, or data dependency
View Full Document