Slide 1Before We Get Started…Here’s Euler, in Diapers…Multiply Using Several BlocksSlide 5Slide 6Synchronization FunctionSlide 8The Three Most Important Parallel Memory SpacesSM Register File (RF) [Tesla C1060]Programmer View of Register FileMatrix Multiplication Example [Tesla C1060]More on Dynamic PartitioningConstant MemorySlide 15Memory Issues Not Addressed Yet…Thread Execution SchedulingCUDA Thread Block [We already know this…]GeForce-8 Series HW OverviewScheduling on the HWThread Scheduling/ExecutionME964High Performance Computing for Engineering Applications“They have computers, and they may have other weapons of mass destruction.” Janet Reno, former Attorney General of the United States© Dan Negrut, 2011ME964 UW-MadisonMemory Layout in CUDAExecution Scheduling in CUDAFebruary 15, 2011Before We Get Started…Last timeWrapped up CUDA API short overviewStarted discussion on memory ecosystem on the GPU cardStarted example of tiled matrix-matrix multiplicationVehicle for introducing the concept of shared memory and thread synchronizationTodayWrap up tiled matrix-matrix multiplicationDiscuss thread scheduling for execution on the GPUHWHW4 has been posted. Due date: 02/17, 11:59 PM Please indicate your preference for midterm project on the forum2Here’s Euler, in Diapers… 3Andrew and Hammad the delivery doctors on duty32 Fermi GPUsEight compute nodes, each with two quad core Intel Xeon 5520Hopefully operational upon your return from Spring breakHopefully you’ll be able to use authentication credentials from Newton to log into EulerMultiply Using Several BlocksOne block computes one square sub-matrix Csub of size Block_SizeOne thread computes one element of CsubAssume that the dimensions of A and B are multiples of Block_Size and square shapeDoesn’t have to be like this, but keeps example simpler and focused on the concepts of interestABCCsubBlock_SizewBwABlock_SizeBlock_SizetxtyBlock_SizeBlock_SizeBlock_SizehAwA4NOTE: Similar example provided in the CUDA Programming Guide 3.2• Available on the class website// Thread block size#define BLOCK_SIZE 16// Forward declaration of the device multiplication func.__global__ void Muld(float*, float*, int, int, float*);// Host multiplication function// Compute C = A * B// hA is the height of A// wA is the width of A// wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB, float* C){ int size; // Load A and B to the device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size); // Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid( wB/dimBlock.x , hA/dimBlock.y ); // Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd); // Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}(continues with next block…)(continues below…)5// Device multiplication function called by Mul()// Compute C = A * B// wA is the width of A// wB is the width of B__global__ void Muld(float* A, float* B, int wA, int wB, float* C){ // Block index int bx = blockIdx.x; // the B (and C) matrix sub-block column index int by = blockIdx.y; // the A (and C) matrix sub-block row index // Thread index int tx = threadIdx.x; // the column index in the sub-block int ty = threadIdx.y; // the row index in the sub-block // Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB; // The element of the block sub-matrix that is computed // by the thread float Csub = 0; // Loop over all the sub-matrices of A and B required to // compute the block sub-matrix for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) { // Shared memory for the sub-matrix of A __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; // Shared memory for the sub-matrix of B __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; // Load the matrices from global memory to shared memory; // each thread loads one element of each matrix As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx]; // Synchronize to make sure the matrices are loaded __syncthreads(); // Multiply the two matrices together; // each thread computes one element // of the block sub-matrix for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx]; // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration __syncthreads(); } // Write the block sub-matrix to global memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}(continues with next block…)6Synchronization FunctionIt’s a device lightweight runtime API functionvoid __syncthreads();Synchronizes all threads in a block (acts as a barrier for all threads of a block)Once all threads have reached this point, execution resumes normallyUsed to avoid RAW/WAR/WAW hazards when accessing shared or global memoryAllowed in conditional constructs only if the conditional is uniform across the entire thread block7The Shared Memory in the Context of the SM Memory Architecture [NVIDIA G80]Threads in a Block:Cooperate through data accessible to all of them both in Global Memory and Shared MemorySynchronize at barrier instructionShared Memory is very goodKeeps data close to processor (low latency)Minimize trips to global memoryDynamically allocated at the SM level to each BlockOne of the limiting resourcest0 t1 t2 …
View Full Document