GT CS 4803 - CUDA Device Memory Space Review - D1690515

Home> Schools> Georgia Tech> Computer Science (CS) > CS 4803> CUDA Device Memory Space Review

DOC PREVIEW

GT CS 4803 - CUDA Device Memory Space Review

School name Georgia Tech

Course Cs 4803- Special Topics

Pages 25

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Spring 2011 Prof. Hyesoon Kim• Each thread can:– R/W per-thread registers– R/W per-thread local memory– R/W per-block shared memory– R/W per-grid global memory– Read only per-grid constant memory– Read only per-grid texture memory(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHost• The host can R/W global, constant, and texture memories© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• Register – dedicated HW - single cycle• Shared Memory – dedicated HW - single cycle• Local Memory – DRAM, no cache - *slow*• Global Memory – DRAM, no cache - *slow*• Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality• Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality• Instruction Memory (invisible) – DRAM, cached© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• All threads access global memory for their input matrix elements– Two memory accesses (8 bytes) per floating point multiply-add– 4B/s of memory bandwidth/FLOPS– 86.4 GB/s limits the code at 21.6 GFLOPS• The actual code should run at about 15 GFLOPS• Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPSDeviceMultiprocessor NMultiprocessor 2Multiprocessor 1Device memoryShared MemoryInstructionUnitProcessor 1Registers…Processor 2RegistersProcessor MRegistersConstantCacheTextureCacheGlobal, constant, texture memories© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• Each input element is read by WIDTH threads.• If we load each element into Shared Memory and have several threads use the local version, we can drastically reduce the memory bandwidth– Load all the matrix ? – Tiled algorithms• Pattern – Copy data from global to shared memory– Synchronization– Computation (iteration)– Synchronization– Copy data from shared to global memory© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUCConsider A,B,C to be N by N matrices of b by b subblocks where b=n / N is called the block size for i = 1 to Nfor j = 1 to N{read block C(i,j) into shared memory}for k = 1 to N{read block A(i,k) into shared memory}{read block B(k,j) into shared memory}C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks}{write block C(i,j) back to global memory}= + *C(i,j) C(i,j)A(i,k)B(k,j)www.sdsc.edu/~allans/cs260/lectures/matmul.ppt= + *C(1,1)C(1,1)A(1,1) B(1,1)www.sdsc.edu/~allans/cs260/lectures/matmul.ppt= + *C(1,1)C(1,1)A(1,2) B(2,1)www.sdsc.edu/~allans/cs260/lectures/matmul.ppt= + *C(1,1)C(1,1)A(1,3) B(3,1)www.sdsc.edu/~allans/cs260/lectures/matmul.ppt= + *C(1,2)C(1,2)A(1,1)B(1,2)www.sdsc.edu/~allans/cs260/lectures/matmul.ppt= + *C(1,2)C(1,2)A(1,2)B(2,2)www.sdsc.edu/~allans/cs260/lectures/matmul.ppt= + *C(1,2)C(1,2)A(1,3) B(3,2)www.sdsc.edu/~allans/cs260/lectures/matmul.ppt• One block computes one square sub-matrix Psub of size BLOCK_SIZE• One thread computes one element of Psub• Assume that the dimensions of M and N are multiples of BLOCK_SIZE and square shapeMNPPsubBLOCK_SIZEWIDTHWIDTHBLOCK_SIZEBLOCK_SIZEbxtx01bsize-120 1 2byty210bsize-1210BLOCK_SIZEBLOCK_SIZEBLOCK_SIZEWIDTHWIDTH© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• Each SMP has 16KB shared memory– Each Thread Block uses 2 *256*4B = 2KB of shared memory. [2: two matrix, 256 = 16*16, 4B (floating point) ] – Can potentially have up to 8 Thread Blocks actively executing– Initial load: • For BLOCK_SIZE = 16, this allows up to 8*512 = 4,096 pending loads (8 blocks, 2 loads * 256) • In practice, there will probably be up to half of this due to scheduling to make use of SPs.– The next BLOCK_SIZE 32 would lead to 2*32*32*4B= 8KB shared memory usage per Thread Block, allowing only up to two Thread Blocks active at the same time© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC// Setup the execution configurationdim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);dim3 dimGrid(N.width / dimBlock.x, M.height / dimBlock.y);For very large N and M dimensions, onewill need to add another level of blocking and execute the second-level blocks sequentially.© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC// Block indexint bx = blockIdx.x;int by = blockIdx.y;// Thread indexint tx = threadIdx.x;int ty = threadIdx.y;// Pvalue stores the element of the block sub-matrix// that is computed by the threadfloat Pvalue = 0;// Loop over all the sub-matrices of M and N// required to compute the block sub-matrixfor (int m = 0; m < M.width/BLOCK_SIZE; ++m) {code from the next few slides };© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC// Get a pointer to the current sub-matrix Msub of MMatrix Msub = GetSubMatrix(M, m, by);// Get a pointer to the current sub-matrix Nsub of NMatrix Nsub = GetSubMatrix(N, bx, m);__shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];__shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE];// each thread loads one element of the sub-matrixMs[ty][tx] = GetMatrixElement(Msub, tx, ty);// each thread loads one element of the sub-matrixNs[ty][tx] = GetMatrixElement(Nsub, tx, ty);© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC• //Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is//located col sub-matrices to the right and row sub-matrices down//from the upper-left corner of A__device__ Matrix GetSubMatrix(Matrix A, const int row, const intcol){Matrix Asub;Asub.width = BLOCK_SIZE;Asub.height = BLOCK_SIZE;Asub.stride = A.stride;Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + BLOCK_SIZE * col];return Asub;}// Synchronize to make sure the sub-matrices are loaded// before starting the computation__syncthreads();// each thread computes one element of the block sub-matrixfor (int k = 0; k < BLOCK_SIZE; ++k)Pvalue += Ms[ty][k] * Ns[k][tx];// Synchronize to make sure that the preceding// computation is done before loading two new// sub-matrices of M and N in the next iteration__syncthreads();© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC// Get a pointer to the block sub-matrix of PMatrix Psub = GetSubMatrix(P, bx, by);// Write the block sub-matrix to device memory;// each thread writes one elementSetMatrixElement(Psub, tx, ty, Pvalue);© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUCMacro functions will be provided.•

View Full Document