UW-Madison ME 964 - Lecture - High Performance Computing for Engineering Applications

Unformatted text preview:

ME964High Performance Computing for Engineering Applications© Dan Negrut, 2012ME964 UW-MadisonCUDA, Further DetailsShared MemorySynchronization in CUDAAtomic OperationsCUDA OptimizationMarch 8, 2012“There are three rules to follow when parallelizing large codes.Unfortunately, no one knows what these rules are.”W. Somerset Maugham and Gary MontryBefore We Get Started… Last time Scheduling issues Thread divergence Device memory Coalesced memory accesses L1 & L2 caches Today Shared memory, further considerations Synchronization issues Atomic operations CUDA Optimization/Best Practicies issues Other issues HW6 due at 11:59 PM Half page proposal for your Midterm Project due today For default project (solving dense banded linear system): no need to submit anything.2__global__ void coalescedMultiply(float *a, float* b, float *c, int N){__shared__ float aTile[TILE_DIM][TILE_DIM];int row = blockIdx.y * blockDim.y + threadIdx.y;int col = blockIdx.x * blockDim.x + threadIdx.x;float sum = 0.0f;aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];for (int i = 0; i < TILE_DIM; i++) {sum += aTile[threadIdx.y][i]* b[i*N+col];}c[row*N+col] = sum;}Shared Memory: Syntax & Semantics You can statically declare shared memory like in the code snippet below: The variable aTile visible to all threads in each block, and only to those threads The thread that executes the kernel above sees the aTile declaration and understands that all its sibling-threads in the block are going to see it too. They share this variable collectively The same thread, when it sees the variable row it understands that it has sole ownership of this variable (variable stored in a register)33 Ways to Set Aside Shared Memory First way: Statically, declared inside a kernel See previous slide… Second way: Through the execution configuration Not that common Ns below indicates size (in bytes) to be allocated in shared memory4__global__ void MyFunc(float*) // __device__ or __global__ function {extern __shared__ float shMemArray[];// Size of shMemArray determined through the execution configuration// You can use shMemArray as you wish here…}// invoke like thisMyFunc<<< Dg, Db, Ns >>>(parameter); Third way: Dynamically, through the CUDA Driver API Advanced feature, uses API function cuFuncSetSharedSize(), not discussed here Common sense observation: in a parallel machine many threads access memory at the same time To service more than one thread, memory is divided into independent banks This layout essential to achieve high bandwidth Each SM has ShMem organized in 32 Memory banks Recall that shared memory and L1 cache draw on the same physical memory inside an SM; i.e., they combine for 64 KB This physical memory can be partitioned as 48 KB of ShMem and 16 KB of L1 cache The other way around Note: shared memory can store less data than the registers (48 KB vs. 128 KB)Bank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 05Shared Memory Architecture[1/2]Shared Memory Architecture[2/2] The 32 banks of the Shared Memory are organized like benches in a movie theater You have multiple rows of benches Each row has 32 benches  In each bench you can “seat” a family of four bytes (32 bits total) Note that a bank represents a column of benches in the movie theater, which is perpendicular to the screen Each bank has a bandwidth of 32 bits per two clock cyclesI$L1MultithreadedInstruction BufferRFC$L1SharedMemOperand SelectMAD SFU6Bank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0Shared Memory: Transaction Rules & Bank Conflicts When reading in four-byte words, 32 threads in a warp attempt to access shared memory simultaneously Bank conflict: the scenario where two different threads access *different* words in the same bank Note that there is no conflict if different threads access any bytes within the same word Bank conflicts enforce the hardware to serialize your ShMem access, which adversely impacts bandwidth7Shared Memory Bank Conflicts If there are no bank conflicts: Shared memory access is fast, but not as fast as register access On the bright side, latency is roughly 100x lower than global memory latency Share memory access, the fast case: If all threads of a warp access different banks, there is no bank conflict If all threads of a warp access an identical address for a fetch operation, there is no bank conflict (broadcast) Share memory access, the slow case: Worst case: 32 threads access 32 different words in the same bank Must serialize all the accesses In general, cost = max # of simultaneous accesses to a single bank8How Addresses Map to Banks on Fermi Successive 32-bit word addresses are assigned to successive banks Bank you work with = (address of offset) % 32 This is because Fermi has 32 banks Example: 1D shared mem array, myShMem, of 1024 floats myShMem[4]: accesses bank #4 (physically, the fifth one – first row) myShMem[31]: accesses bank #31 (physically, the last one – first row) myShMem[50]: access bank #18 (physically, the 19thone – second row) myShMem[128]: access bank #0 (physically, the first one – fifth row) myShMem[178]: access bank #18 (physically, the 19thone – sixth row) NOTE: If, for instance, the third thread in a warp accesses myShMem[50] and the eight thread in the warp access myShMem[178], then you have a two-way bank conflict and the two transactions get serialized IMPORTANT: There is no such thing as “bank conflicts” between threads belonging to different warps9Bank Addressing ExamplesTransactions Involving 4 Byte Words No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 PermutationBank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0Thread 31Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0Bank 31Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0Thread 31Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 010Bank Addressing ExamplesTransactions Involving 4 Byte Words11Other Examples12 Two “no conflict” scenarios: Broadcast: all threads in a warp access the same word in a bank Multicast: several threads in a warp access the same word in the same bankLinear Addressing Given:__shared__ float sharedM[256]; float foo = sharedM[baseIndex + s * threadIdx.x]; This is bank-conflict-free if s shares no


View Full Document

UW-Madison ME 964 - Lecture - High Performance Computing for Engineering Applications

Documents in this Course
Load more
Download Lecture - High Performance Computing for Engineering Applications
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture - High Performance Computing for Engineering Applications and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture - High Performance Computing for Engineering Applications 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?