UW-Madison ME 964 - High Performance Computing for Engineering Applications

Unformatted text preview:

ME964High Performance Computing for Engineering Applications© Dan Negrut, 2012ME964 UW-MadisonCUDA, Further DetailsExecution SchedulingGlobal MemoryShared MemoryMarch 6, 2012“A computer will do what you tell it to do, but that may be much different from what you had in mind”Joseph WeizenbaumBefore We Get Started… Last time Profiling CUDA code to improve performance Examples: debugging and profiling of 1D stencil code Lessons learned: Always spend a bit of time profiling of your code Most likely, what dictates the performance of your code is the efficiency of the memory transactions Today Scheduling issues, advanced topics Device memory, advanced topics Shared memory, advanced topics Other issues HW6 due on Th at 11:59 PM Half page proposal for your Midterm Project due on Th For default project (solving dense banded linear system): no need to submit anything.2Acknowledgement Several slides in today’s lecture include material provided by James Balfour of NVIDIA A sign such as the one at the bottom of this slides acknowledges his contribution Any inadvertence in these slides belongs to me3NVIDIA [J. Balfour]→Thread Blocks are Executed as Warps Each thread block split into one or more warps When the thread block size is not a multiple of the warp size, unused threads within the last warp are disabled automatically  The hardware schedules each warp independently Warps within a thread block can execute independently4Warp of 32 threadsWarp of 32 threadsBlock of 128 threadsWarp of 32 threadsWarp of 32 threadsNVIDIA [J. Balfour]→Organizing Threads into Warps Thread IDs within a warp are consecutive and increasing This goes back to the 1D projection from thread index to thread ID Remember: In multidimensional blocks, the x thread index runs first, followed by the y thread index, and finally followed by the z thread index  Threads with ID 0 through 31 make up Warp 0, 32 through 63 make up Warp 1, etc. Partitioning of threads in warps is always the same You can use this knowledge in control flow  So far, the warp size of 32 has been kept constant from device to device and CUDA version to CUDA version While you can rely on ordering among threads, DO NOT rely on any ordering among warps since there is no such thing Warp scheduling is not something you control through CUDA5Thread and Warp Scheduling An SM can switch between warps with no apparent overhead Warps with instruction whose inputs are ready are eligible to execute, and will be considered when scheduling When a warp is selected for execution, all [active] threads execute the same instruction in lockstep fashion6WnExecuting WnWaiting for dataReady to executeW1W2W3W4NVIDIA [J. Balfour]→Filling Warps Prefer thread block sizes that result in mostly full warpsBad: kernel<<<N, 1>>> ( ... )Okay: kernel<<<(N+31) / 32, 32>>>( ... )Better: kernel<<<(N+127) / 128, 128>>>( ... ) Prefer to have enough threads per block to provide hardware with many warps to switch between  This is how the GPU hides memory access latency Resource like __shared__ may constrain number of threads per block Algorithm and decomposition will establish some preferred amount of shared data and __shared__ allocation7NVIDIA [J. Balfour]→Control Flow Divergence[1/4] Consider the following code:8__global__ void odd_even(int n, int* x){int i = threadIdx.x + blockDim.x * blockIdx.x;if( (i & 0x01) == 0 ){x[i] = x[i] + 1;}else{x[i] = x[i] + 2;}} Half the threads in the warp execute the if clause, the other half the else clauseNVIDIA [J. Balfour]→Control Flow Divergence[2/4] The system automatically handles control flow divergence, conditions in which threads within a warp execute different paths through a kernel Often, this requires that the hardware execute multiple paths through a kernel for a warp For example, both the if clause and the corresponding else clause9NVIDIA [J. Balfour]→Control Flow Divergence[3/4]10__global__ void kv(int* x, int* y){int i = threadIdx.x + blockDim.x * blockIdx.x;int t;bool b = f(x[i]);if( b ){ // g(x)t = g(x[i]); }else{ // h(x)t = h(x[i])); }y[i] = t;}NVIDIA [J. Balfour]→Control Flow Divergence[4/4] Nested branches are handled similarly Deeper nesting results in more threads being temporarily disabled In general, one does not need to consider divergence when reasoning about the correctness of a program Certain code constructs, such as those involving schemes in which threads within a warp spin-wait on a lock, can cause deadlock In general, one does need to consider divergence when reasoning about the performance of a program11NVIDIA [J. Balfour]→Performance of Divergent Code[1/2] Performance decreases with degree of divergence in warps Here’s an extreme example…12__global__ void dv(int* x){int i = threadIdx.x + blockDim.x * blockIdx.x;switch (i % 32){case 0 : x[i] = a(x[i]);break;case 1 : x[i] = b(x[i]);break;...case 31: x[i] = v(x[i]);break;}}NVIDIA [J. Balfour]→Performance of Divergent Code [2/2] Compiler and hardware can detect when all threads in a warp branch in the same direction For example, all take the if clause, or all take the else clause The hardware is optimized to handle these cases without loss of performance In other words, use of if or switch does not automatically translate into disaster:if (threadIdx.x / WARP_SIZE >= 2) { } Creates two different control paths for threads in a block Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path. There is no warp divergence… The compiler can also compile short conditional clauses to use predicates (bits that conditional convert instructions into null ops)  Avoids some branch divergence overheads, and is more efficient Often acceptable performance with short conditional clauses13NVIDIA [J. Balfour]→Data Access “Divergence” Concept is similar to control divergence and often conflated Hardware is optimized for accessing contiguous blocks of global memory when performing loads and stores If a warp doesn’t access a contiguous block of global memory the effective bandwidth is getting reduced Remember this: when you look at a kernel you see what a collection of threads; i.e., a warp, is supposed to do in lockstep fashion14NVIDIA [J. Balfour]→The GPU Memory Ecosystem[Quick Review]  The significant


View Full Document

UW-Madison ME 964 - High Performance Computing for Engineering Applications

Documents in this Course
Load more
Download High Performance Computing for Engineering Applications
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view High Performance Computing for Engineering Applications and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view High Performance Computing for Engineering Applications 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?