Duke CPS 296.1 - Survey of General-Purpose Computation on GPU - D1693149

Home> Schools> Duke University> (CPS) > CPS 296.1> Survey of General-Purpose Computation on GPU

DOC PREVIEW

Duke CPS 296.1 - Survey of General-Purpose Computation on GPU

School name Duke University

Course Cps 296.1- Introduction to Computer Vision

Pages 4

This preview shows page 1 out of 4 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 4 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Survey of General-Purpose Computation on GPU and Introduction to CUDAJie Xu and Cong Wuadapt by Wei-mei W.Hwu,Taiwan.Introduction to GPGPU• Powerful and Inexpensive– Semiconductor capability, driven by advances in fabrication technology, market•Flexible and programmable•Flexible and programmable• Limitations and difficultiesApplication• Physically Based Simulation • Signal and Image Processing– Image segmentationCtVii–Computer Vision– Image Processing• Geometric ComputingDesign philosophies are different.• The GPU is specialized for compute-intensive, massively data parallel computation (exactly what graphics rendering is about)–So more transistors can be devoted to dataSo, more transistors can be devoted to data processing rather than data caching and flow controlDRAMCacheALUControlALUALUALUDRAMCPUGPU• Significant application-level speedup over uni-processor execution– No more “killer micros”• Easy entrance – An initial, naïve code typically get at least 2-3X speedup• Wide availability to end users– available on laptops, desktops, clusters, super-computersGPGPU Movement• General Purpose computation using GPUin applications other than 3D graphics– GPU accelerates critical path of application• Data parallel algorithms leverage GPU attributespg g– Large data arrays, streaming throughput– Fine-grain SIMD parallelism– Low-latency floating point (FP) computation• Applications – see //GPGPU.org– Game effects (FX) physics, image processing– Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting2GPGPU Constraints• Dealing with graphics API– Working with the corner cases of the graphics API• Addressing modes–Limited texture size/dimensionInput RegistersFragment Program• Shader capabilities– Limited outputs• Instruction sets– Lack of Integer & bit opsOutput RegistersTextureFB MemoryThese have all changed with CUDA!CUDA –C with no shader limitations!• Integrated host+device app C program– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C codeSerial Code (host)© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008. . .. . .Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);Serial Code (host)Parallel Kernel (device)KernelB<<< nBlk, nTid >>>(args);CUDA Extends C• Declspecs– global, device, shared, local, constant• Keywords– threadIdx, blockIdx• Intrinsicsth d__device__ float filter[N]; __global__ void convolve (float *image) {__shared__ float region[M];... region[threadIdx] = image[i]; th d ()© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008–__syncthreads• Runtime API– Memory, symbol, execution management• Function launch__syncthreads() ... image[j] = result;}// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);Arrays of Parallel Threads• A CUDA kernel is executed by an array ofthreads– All threads run the same code (SPMD)– Each thread has an ID that it uses to compute memory addresses and make tld ii© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 200810control decisionsA Simple Running ExampleMatrix Multiplicationp• P = M * N of size WIDTH x WIDTH• Without tiling:– One thread calculates one element of P– M and N are loaded WIDTH timesfrom global memoryNWIDTH© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008from global memoryM PWIDTHWIDTH WIDTH3Step 1: Matrix MultiplicationA Simple Host Version in CNWIDTH// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i)for (int j = 0; j < Width; ++j) {double sum = 0;kj© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008M PWIDTHWIDTHWIDTHdouble sum = 0;for (int k = 0; k < Width; ++k) {double a = M[i * width + k];double b = N[k * width + j];sum += a * b;}P[i * Width + j] = sum;}}ikvoid MatrixMulOnDevice(float* M, float* N, float* P, int Width){int size = Width * Width * sizeof(float); float* Md, Nd, Pd;…1. // Allocate and Load M, N to device memory c daMalloc(&Md si e)Step 2: Input Matrix Data Transfer(Host-side Code)© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);// Allocate P on the devicecudaMalloc(&Pd, size);Step 3: Output Matrix Data Transfer(Host-side Code)2. // Kernel invocation code – to be shown later…3 // Read P from the device© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 20083. // Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree (Pd);}Step 4: Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){// 2D Thread ID© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008int tx = threadIdx.x;int ty = threadIdx.y;// Pvalue is used to store the element of the matrix// that is computed by the threadfloat Pvalue = 0;NdWIDTHStep 4: Kernel Function (cont.)for (int k = 0; k < Width; ++k){ float Melement = Md[ty * Width + k];float Nelement = Nd[k * Width + tx];Pl Ml t*Nl ttxk© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008Md PdWIDTHWIDTHWIDTHPvalue += Melement * Nelement;}// Write the matrix to device memory;// each thread writes one elementPd[ty * Width + tx] = Pvalue;}tytxtyk// Setup the execution configurationdim3 dimBlock(Width, Width);Step 5: Kernel Invocation(Host-side Code) © David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008(, );dim3 dimGrid(1, 1);// Launch the device computation threads!MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd);4Only One Thread Block Used• One Block of threads compute matrix Pd– Each thread computes one element of Pd• Each thread– Loads a row of matrix Md–Loads a column of matrix NdGrid 1Block 12426Thread(2, 2)Nd© David Kirk/NVIDIA and Wen-mei W. HwuTaiwan, June 30-July 2, 2008Loads a column of matrix Nd– Perform one multiply and addition for each pair of Md and Nd elements– Compute to off-chip memory access ratio close to 1:1 (not very high)• Size of matrix limited by the number of threads allowed in a thread block3 2 5

View Full Document