UD CISC 879 - Nvidia CUDA Programming Basics - D657347

Home> Schools> University of Delaware> Computer/Information Sciences (CISC) (CISC) > CISC 879> Nvidia CUDA Programming Basics

UD CISC 879 - Nvidia CUDA Programming Basics

School name University of Delaware

Course Cisc 879- Advanced Topics in Architecture and Software Systems: TEXT ANALYSIS FOR SOFT. ENG

Pages 36

Download Save

Unformatted text preview:

Nvidia CUDA Programming BasicsOverviewCUDA Programming ModelSlide 4Slide 5Slide 6Slide 7CUDA Memory ModelSlide 9Slide 10Slide 11CUDA API BasicsSlide 13Slide 14Slide 15Slide 16Slide 17Example: Scalar ProductSlide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25A CUDA implementation of the GravitBasic ImplementationSlide 28CPU/GPU ComparisonSlide 30Slide 31Slide 32Spatial SubdivisionSlide 34Conclusion / Future WorkQuestions?Nvidia CUDA Programming BasicsXiaoming LiDepartment of Electrical and Computer EngineeringUniversity of DelawareOverview•The Programming model•The Memory model •CUDA API basics•A simple example for a kernel function•Optimization of GravitCUDA Programming Model•The GPU is seen as a compute device to execute a portion of an application that–Has to be executed many times–Can be isolated as a function–Works independently on different data•Such a function can be compiled to run on the device. The resulting program is called a KernelCUDA Programming Model•The batch of threads that executes a kernel is organized as a grid of thread blocksCUDA Programming Model•Thread Block –Batch of threads that can cooperate together•Fast shared memory•Synchronizable•Thread ID–Block can be one-, two- or three-dimensional arraysCUDA Programming Model•Grid of Thread Block –Limited number of threads in a block–Allows larger numbers of thread to execute the same kernel with one invocation–Blocks identifiable via block ID–Leads to a reduction in thread cooperation–Blocks can be one- or two-dimensional arraysCUDA Programming ModelCUDA Memory ModelCUDA Memory Model•Shared Memory –Is on-chip:•much faster than the local and global memory,•as fast as a register when no bank conflicts,•divided into equally-sized memory banks.–Successive 32-bit words are assigned to successive banks,–Each bank has a bandwidth of 32 bits per clock cycle.CUDA Memory Model•Shared Memory Reminder: warp size is 32, number of banks is 16 •memory request requires two cycles for a warp–One for the first half, one for the second half of the warpNo conflicts between threads from first and second halfCUDA Memory Model•Shared MemoryCUDA API Basics•An Extension to the C Programming Language –Function type qualifiers to specify execution on host or device–Variable type qualifiers to specify the memory location on the device–A new directive to specify how to execute a kernel on the device–Four built-in variables that specify the grid and block dimensions and the block and thread indicesCUDA API Basics•Function type qualifiers __device__ •Executed on the device •Callable from the device only. __global__ •Executed on the device, •Callable from the host only. __host__ •Executed on the host, •Callable from the host only.CUDA API Basics•Variable Type Qualifiers __device__ •Resides in global memory space, •Has the lifetime of an application, •Is accessible from all the threads within the grid and from the host through the runtime library. __constant__ (optionally used together with __device__) •Resides in constant memory space, •Has the lifetime of an application, •Is accessible from all the threads within the grid and from the host through the runtime library. __shared__ (optionally used together with __device__) •Resides in the shared memory space of a thread block, •Has the lifetime of the block, •Is only accessible from all the threads within the block.CUDA API Basics•Execution Configuration (EC)–Must be specified for any call to a __global__ function.–Defines the dimension of the grid and blocks–specified by inserting an expression between function name and argument list:function:__global__ void Func(float* parameter); must be called like this: Func<<< Dg, Db, Ns >>>(parameter);CUDA API Basics•Execution Configuration (EC)Where Dg, Db, Ns are :–Dg is of type dim3  dimension and size of the grid •Dg.x * Dg.y = number of blocks being launched; –Db is of type dim3  dimension and size of each block•Db.x * Db.y * Db.z = number of threads per block; –Ns is of type size_t  number of bytes in shared memory that is dynamically allocated in addition to the statically allocated memory•Ns is an optional argument which defaults to 0.CUDA API Basics•Built-in Variables–gridDim is of type dim3 dimensions of the grid. –blockIdx is of type uint3  block index within the grid.–blockDim is of type dim3  dimensions of the block. –threadIdx is of type uint3  thread index within the block.Example: Scalar Product•Calculate the scalar product of–32 vector pairs–4096 elements each•An efficient way to run that on the device is to organize the calculation in–A grid of 32 blocks–With 256 threads per block•This gives us 4096/265 = 16 slices per vectorExample: Scalar Product•The data will be handed to the device as two data arrays and the results will be saved in a result array•Each product of a vector pair An, Bn will be calculated in slices, which will be added up to obtain the final resultVector A0 Vector A1 Vector AN-1…Vector B0 Vector B1 Vector BN-1…Results 0 to N-1Vector A0Vector B0Results 0 Results 1Partial results 0 to S-1slice 0 slice 1 slice S-1…Example: Scalar ProductThe host programmint main(int argc, char *argv[]){ CUT_CHECK_DEVICE(); … h_A = (float *)malloc(DATA_SZ); … cudaMalloc((void **)&d_A, DATA_SZ); … cudaMemcpy(d_A, h_A, DATA_SZ, cudaMemcpyHostToDevice); … ProdGPU<<<BLOCK_N, THREAD_N>>>(d_C, d_A, d_B); … cudaMemcpy(h_C_GPU, d_C, RESULT_SZ, cudaMemcpyDeviceToHost); … CUDA_SAFE_CALL( cudaFree(d_A) ); free(h_A); … CUT_EXIT(argc, argv);}Example: Scalar ProductThe Kernel Function•Parameters:–d_C: pointer to result array–d_A, d_B pointers to input data•Local data arrays:–t[]: results of single threads–r[]: slice cache•I: Thread Id in block__global__ void ProdGPU(float *d_C, float *d_A, float *d_B){ __shared__ float t[THREAD_N]; __shared__ float r[SLICE_N]; const int I = threadIdx.x; for(int vec_n=blockIdx.x; vec_n<VECTOR_N; vec_n+=gridDim.x){ int base = ELEMENT_N * vec_n; for(int slice = 0; slice < SLICE_N; slice++, base += THREAD_N){ t[I] = d_A[base + I] * d_B[base + I]; __syncthreads(); for(int stride = THREAD_N / 2; stride > 0; stride /= 2){ if(I < stride) t[I] += t[stride + I]; __syncthreads(); } if(I == 0) r[slice] = t[0]; } for(int stride = SLICE_N / 2; stride >

View Full Document


School:
Email:
New Password:
Confirm Password:

UD CISC 879 - Nvidia CUDA Programming Basics

Sign up for free to view:

Please select your school