ME964High Performance Computing for Engineering Applications© Dan Negrut, 2012ME964 UW-MadisonCUDA, Further DetailsShared MemorySynchronization in CUDAAtomic OperationsCUDA OptimizationMarch 8, 2012“There are three rules to follow when parallelizing large codes.Unfortunately, no one knows what these rules are.”W. Somerset Maugham and Gary MontryBefore We Get Started… Last time Scheduling issues Thread divergence Device memory Coalesced memory accesses L1 & L2 caches Today Shared memory, further considerations Synchronization issues Atomic operations CUDA Optimization/Best Practicies issues Other issues HW6 due at 11:59 PM Half page proposal for your Midterm Project due today For default project (solving dense banded linear system): no need to submit anything.2__global__ void coalescedMultiply(float *a, float* b, float *c, int N){__shared__ float aTile[TILE_DIM][TILE_DIM];int row = blockIdx.y * blockDim.y + threadIdx.y;int col = blockIdx.x * blockDim.x + threadIdx.x;float sum = 0.0f;aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];for (int i = 0; i < TILE_DIM; i++) {sum += aTile[threadIdx.y][i]* b[i*N+col];}c[row*N+col] = sum;}Shared Memory: Syntax & Semantics You can statically declare shared memory like in the code snippet below: The variable aTile visible to all threads in each block, and only to those threads The thread that executes the kernel above sees the aTile declaration and understands that all its sibling-threads in the block are going to see it too. They share this variable collectively The same thread, when it sees the variable row it understands that it has sole ownership of this variable (variable stored in a register)33 Ways to Set Aside Shared Memory First way: Statically, declared inside a kernel See previous slide… Second way: Through the execution configuration Not that common Ns below indicates size (in bytes) to be allocated in shared memory4__global__ void MyFunc(float*) // __device__ or __global__ function {extern __shared__ float shMemArray[];// Size of shMemArray determined through the execution configuration// You can use shMemArray as you wish here…}// invoke like thisMyFunc<<< Dg, Db, Ns >>>(parameter); Third way: Dynamically, through the CUDA Driver API Advanced feature, uses API function cuFuncSetSharedSize(), not discussed here Common sense observation: in a parallel machine many threads access memory at the same time To service more than one thread, memory is divided into independent banks This layout essential to achieve high bandwidth Each SM has ShMem organized in 32 Memory banks Recall that shared memory and L1 cache draw on the same physical memory inside an SM; i.e., they combine for 64 KB This physical memory can be partitioned as 48 KB of ShMem and 16 KB of L1 cache The other way around Note: shared memory can store less data than the registers (48 KB vs. 128 KB)Bank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 05Shared Memory Architecture[1/2]Shared Memory Architecture[2/2] The 32 banks of the Shared Memory are organized like benches in a movie theater You have multiple rows of benches Each row has 32 benches In each bench you can “seat” a family of four bytes (32 bits total) Note that a bank represents a column of benches in the movie theater, which is perpendicular to the screen Each bank has a bandwidth of 32 bits per two clock cyclesI$L1MultithreadedInstruction BufferRFC$L1SharedMemOperand SelectMAD SFU6Bank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0Shared Memory: Transaction Rules & Bank Conflicts When reading in four-byte words, 32 threads in a warp attempt to access shared memory simultaneously Bank conflict: the scenario where two different threads access *different* words in the same bank Note that there is no conflict if different threads access any bytes within the same word Bank conflicts enforce the hardware to serialize your ShMem access, which adversely impacts bandwidth7Shared Memory Bank Conflicts If there are no bank conflicts: Shared memory access is fast, but not as fast as register access On the bright side, latency is roughly 100x lower than global memory latency Share memory access, the fast case: If all threads of a warp access different banks, there is no bank conflict If all threads of a warp access an identical address for a fetch operation, there is no bank conflict (broadcast) Share memory access, the slow case: Worst case: 32 threads access 32 different words in the same bank Must serialize all the accesses In general, cost = max # of simultaneous accesses to a single bank8How Addresses Map to Banks on Fermi Successive 32-bit word addresses are assigned to successive banks Bank you work with = (address of offset) % 32 This is because Fermi has 32 banks Example: 1D shared mem array, myShMem, of 1024 floats myShMem[4]: accesses bank #4 (physically, the fifth one – first row) myShMem[31]: accesses bank #31 (physically, the last one – first row) myShMem[50]: access bank #18 (physically, the 19thone – second row) myShMem[128]: access bank #0 (physically, the first one – fifth row) myShMem[178]: access bank #18 (physically, the 19thone – sixth row) NOTE: If, for instance, the third thread in a warp accesses myShMem[50] and the eight thread in the warp access myShMem[178], then you have a two-way bank conflict and the two transactions get serialized IMPORTANT: There is no such thing as “bank conflicts” between threads belonging to different warps9Bank Addressing ExamplesTransactions Involving 4 Byte Words No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 PermutationBank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0Thread 31Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0Bank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0Thread 31Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 010Bank Addressing ExamplesTransactions Involving 4 Byte Words11Other Examples12 Two “no conflict” scenarios: Broadcast: all threads in a warp access the same word in a bank Multicast: several threads in a warp access the same word in the same bankLinear Addressing Given:__shared__ float sharedM[256]; float foo = sharedM[baseIndex + s * threadIdx.x]; This is bank-conflict-free if s shares no
View Full Document