Unformatted text preview:

CSE 591: GPU ProgrammingPerformance ConsiderationsKlaus MuellerComputer Science DepartmentStony Brook UniversityOptimizing PerformanceA balancing act•global memory bandwidth• local memory size• dynamic partitioning of resources• data prefetching• instruction mix• thread scheduling• thread granularity© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign3Quick terminology review• Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads)– The unit of parallelism in CUDA• Warp: a group of threads executed physically in parallel in G80 Æ SIMT (Same Instruction Multiple Threads)• Block: a group of threads that are executed together and form the unit of resource assignment• Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign4How thread blocks are partitioned• Thread blocks are partitioned into warps– Thread IDs within a warp are consecutive and increasing– Warp 0 starts with Thread ID 0• Partitioning is always the same– Thus you can use this knowledge in control flow – However, the exact size of warps may change from generation to generation– (Covered next)• However, DO NOT rely on any ordering between warps– If there are any dependencies between threads, you must __syncthreads() to get correct results© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign5Control Flow Instructions• Main performance concern with branching is divergence– Threads within a single warp take different paths– Different execution paths are serialized in G80• The control paths taken by the threads in a warp are traversed one at a time until there is no more.• A common case: avoid divergence when branch condition is a function of thread ID– Example with divergence: • If (threadIdx.x > 2) { }• This creates two different control paths for threads in a block• Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp– Example without divergence:• If (threadIdx.x / WARP_SIZE > 2) { }• Also creates two different control paths for threads in a block• Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign6Parallel Reduction• Given an array of values, “reduce” them to a single value in parallel•Examples – sum reduction: sum of all values in the array– Max reduction: maximum of all values in the array• Typically parallel implementation:– Recursively halve # threads, add two values per thread– Takes log(n) steps for n elements, requires n/2 threads© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign7A Vector Reduction Example• Assume an in-place reduction using shared memory– The original vector is in device global memory– The shared memory used to hold a partial sum vector– Each iteration brings the partial sum vector closer to the final sum– The final solution will be in element 0© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign8A simple implementation• Assume we have already loaded array into– __shared__ float partialSum[]unsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {__syncthreads();if (t % (2*stride) == 0)partialSum[t] += partialSum[t+stride];}© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign9Vector Reduction with Bank Conflicts0 1 2 3 4 5 76 1098 110+1 2+3 4+5 6+7 10+118+90...3 4..7 8..110..7 8..15123Array elements iterations© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign10Vector Reduction with Branch Divergence0 1 2 3 4 5 76 1098 110+1 2+3 4+5 6+7 10+118+90...3 4..7 8..110..7 8..15123Array elements iterationsThread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign11Some Observations• In each iterations, two control flow paths will be sequentially traversed for each warp– Threads that perform addition and threads that do not– Threads that do not perform addition may cost extra cycles depending on the implementation of divergence• No more than half of threads will be executing at any time– All odd index threads are disabled right from the beginning!– On average, less than ¼ of the threads will be activated for all warps over time.– After the 5thiteration, entire warps in each block will be disabled, poor resource utilization but no divergence.• This can go on for a while, up to 4 more iterations (512/32=16= 24), where each iteration only has one thread activated until all warps retire © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign12Short comings of the implementation• Assume we have already loaded array into– __shared__ float partialSum[]unsigned int t = threadIdx.x;for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) {__syncthreads();if (t % (2*stride) == 0)partialSum[t] += partialSum[t+stride];}BAD: Divergence due to interleaved branch decisions© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign13A better implementation• Assume we have already loaded array into– __shared__ float partialSum[]unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x >>1; stride > 0; stride >> 1) {__syncthreads();if (t < stride)partialSum[t] += partialSum[t+stride];}© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign14Thread 0No Divergence until < 16 sub-sums 0 1 2 3 … 13 1514 181716 190+16 15+31134© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL, University of Illinois, Urbana-Champaign15Some Observations About the New Implementation• Only the last 5 iterations will have divergence• Entire warps will be shut down as iterations progress– For a 512-thread block, 4 iterations to shut down all but one warps in each block– Better resource utilization, will likely retire


View Full Document
Download Performance Considerations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Performance Considerations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Performance Considerations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?