DOC PREVIEW
UW-Madison ME 964 - High Performance Computing for Engineering Applications

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Before We Get Started…Here’s Euler, in Diapers…Multiply Using Several BlocksSlide 5Slide 6Synchronization FunctionSlide 8The Three Most Important Parallel Memory SpacesSM Register File (RF) [Tesla C1060]Programmer View of Register FileMatrix Multiplication Example [Tesla C1060]More on Dynamic PartitioningConstant MemorySlide 15Memory Issues Not Addressed Yet…Thread Execution SchedulingCUDA Thread Block [We already know this…]GeForce-8 Series HW OverviewScheduling on the HWThread Scheduling/ExecutionME964High Performance Computing for Engineering Applications“They have computers, and they may have other weapons of mass destruction.” Janet Reno, former Attorney General of the United States© Dan Negrut, 2011ME964 UW-MadisonMemory Layout in CUDAExecution Scheduling in CUDAFebruary 15, 2011Before We Get Started…Last timeWrapped up CUDA API short overviewStarted discussion on memory ecosystem on the GPU cardStarted example of tiled matrix-matrix multiplicationVehicle for introducing the concept of shared memory and thread synchronizationTodayWrap up tiled matrix-matrix multiplicationDiscuss thread scheduling for execution on the GPUHWHW4 has been posted. Due date: 02/17, 11:59 PM Please indicate your preference for midterm project on the forum2Here’s Euler, in Diapers… 3Andrew and Hammad the delivery doctors on duty32 Fermi GPUsEight compute nodes, each with two quad core Intel Xeon 5520Hopefully operational upon your return from Spring breakHopefully you’ll be able to use authentication credentials from Newton to log into EulerMultiply Using Several BlocksOne block computes one square sub-matrix Csub of size Block_SizeOne thread computes one element of CsubAssume that the dimensions of A and B are multiples of Block_Size and square shapeDoesn’t have to be like this, but keeps example simpler and focused on the concepts of interestABCCsubBlock_SizewBwABlock_SizeBlock_SizetxtyBlock_SizeBlock_SizeBlock_SizehAwA4NOTE: Similar example provided in the CUDA Programming Guide 3.2• Available on the class website// Thread block size#define BLOCK_SIZE 16// Forward declaration of the device multiplication func.__global__ void Muld(float*, float*, int, int, float*);// Host multiplication function// Compute C = A * B// hA is the height of A// wA is the width of A// wB is the width of Bvoid Mul(const float* A, const float* B, int hA, int wA, int wB, float* C){ int size; // Load A and B to the device float* Ad; size = hA * wA * sizeof(float); cudaMalloc((void**)&Ad, size); cudaMemcpy(Ad, A, size, cudaMemcpyHostToDevice); float* Bd; size = wA * wB * sizeof(float); cudaMalloc((void**)&Bd, size); cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice); // Allocate C on the device float* Cd; size = hA * wB * sizeof(float); cudaMalloc((void**)&Cd, size); // Compute the execution configuration assuming // the matrix dimensions are multiples of BLOCK_SIZE dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); dim3 dimGrid( wB/dimBlock.x , hA/dimBlock.y ); // Launch the device computation Muld<<<dimGrid, dimBlock>>>(Ad, Bd, wA, wB, Cd); // Read C from the device cudaMemcpy(C, Cd, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(Ad); cudaFree(Bd); cudaFree(Cd);}(continues with next block…)(continues below…)5// Device multiplication function called by Mul()// Compute C = A * B// wA is the width of A// wB is the width of B__global__ void Muld(float* A, float* B, int wA, int wB, float* C){ // Block index int bx = blockIdx.x; // the B (and C) matrix sub-block column index int by = blockIdx.y; // the A (and C) matrix sub-block row index // Thread index int tx = threadIdx.x; // the column index in the sub-block int ty = threadIdx.y; // the row index in the sub-block // Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB; // The element of the block sub-matrix that is computed // by the thread float Csub = 0; // Loop over all the sub-matrices of A and B required to // compute the block sub-matrix for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) { // Shared memory for the sub-matrix of A __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; // Shared memory for the sub-matrix of B __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; // Load the matrices from global memory to shared memory; // each thread loads one element of each matrix As[ty][tx] = A[a + wA * ty + tx]; Bs[ty][tx] = B[b + wB * ty + tx]; // Synchronize to make sure the matrices are loaded __syncthreads(); // Multiply the two matrices together; // each thread computes one element // of the block sub-matrix for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx]; // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration __syncthreads(); } // Write the block sub-matrix to global memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}(continues with next block…)6Synchronization FunctionIt’s a device lightweight runtime API functionvoid __syncthreads();Synchronizes all threads in a block (acts as a barrier for all threads of a block)Once all threads have reached this point, execution resumes normallyUsed to avoid RAW/WAR/WAW hazards when accessing shared or global memoryAllowed in conditional constructs only if the conditional is uniform across the entire thread block7The Shared Memory in the Context of the SM Memory Architecture [NVIDIA G80]Threads in a Block:Cooperate through data accessible to all of them both in Global Memory and Shared MemorySynchronize at barrier instructionShared Memory is very goodKeeps data close to processor (low latency)Minimize trips to global memoryDynamically allocated at the SM level to each BlockOne of the limiting resourcest0 t1 t2 …


View Full Document

UW-Madison ME 964 - High Performance Computing for Engineering Applications

Documents in this Course
Load more
Download High Performance Computing for Engineering Applications
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view High Performance Computing for Engineering Applications and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view High Performance Computing for Engineering Applications 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?