ME964High Performance Computing for Engineering ApplicationsMost of the time I don't have much fun. The rest of the time I don't have any fun at all. – Woody Allen© Dan Negrut, 2011ME964 UW-MadisonThe CUDA APIFebruary 08, 2011Before We Get Started… Last time Andrew: wrap up building CUDA apps in Visual Studio 2008 Andrew: running apps through the HPC scheduler on Newton Very high-level overview of the CUDA programming model Discussed Index issues in the context of the “execution configuration” and how the index of a thread translates into an ID of a thread Brief discussion of the memory spaces in relation to GPU computing Today Discussion of the CUDA API One-on-one with Andrew if you have compile/build issues in CUDA 3-5 PM in room 2042ME HW HW2: due date was 02/08. Now 02/10 HW3 has been posted. Due date: 02/15 Small matrix-vector multiplication Matrix addition – requires use of multiple blocks2Putting Things in Perspective… CUDA programming model and execution configuration Basic concepts and data types - just finished this… CUDA application programming interface Working on it next Simple example to illustrate basic concepts and functionality Coming up shortly Performance features will be covered later34The CUDA APIWhat is an API? Application Programming Interface (API) A set of functions, procedures or classes that an operating system, library, or service provides to support requests made by computer programs (from Wikipedia) Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc. Cooked up analogy (for the mechanical engineer) Think of a car, you can say it has a certain Device Operating Interface (DOI): A series of pedals, gauges, steering wheel, etc. This would be its DOI In this context, CUDA provides an API that enables you to tap into the computational resources of the NVIDIA’s GPUs This is what replaced the old GPGPU way of programming the hardware5On the CUDA API Reading the CUDA Programming Guide you’ll run numerous references to the CUDA Runtime API and CUDA Driver API Many time they talk about “CUDA runtime” and “CUDA driver”. What they mean is CUDA Runtime API and CUDA Driver API CUDA Runtime API – is the friendly face that you can choose to see when interacting with the GPU. This is what gets identified with “C CUDA” Needs nvcc compiler to generate an executable CUDA Driver API – this is more like how it was back in the day: low level way of interacting with the GPU You have significantly more control over the host-device interaction Significantly clunkier way to dialogue with the GPU, typically only needs a C compiler I don’t anticipate any reason to use the CUDA Driver API6Talking about the API:The C CUDA Software Stack Image at right indicates where the API fits in the picture7An API layer is indicated by a thick red line: NOTE: any CUDA runtime function has a name that starts with “cuda” Examples: cudaMalloc, cudaFree, cudaMemcpy, etc. Examples of CUDA Libraries: CUFFT, CUBLAS, CUSP, thrust, etc.CUDA Function DeclarationsExecuted on the:Only callable from the:__device__ float DeviceFunc()device device__global__ void KernelFunc()device host__host__ float HostFunc()host host __global__ defines a kernel function Must return void __device__ and __host__ can be used together8CUDA Function Declarations (cont.) __device__ functions can’t have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments Something like printf would not work…9HK-UIUCCompiling CUDA Any source file containing CUDA language extensions must be compiled with nvcc You spot such a file by its .cu suffix nvcc is a compile driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... nvcc can output: C code Must then be compiled with the rest of the application using another tool ptx code (CUDA’s ISA) Or directly object code (cubin)10Compiling CUDA nvcc Compile driver Invokes cudacc, gcc, cl, etc. PTX Parallel Thread eXecution Like assembly language NVIDIA’s ISANVCCC/C++ CUDAApplicationPTX to TargetCompilerG80 … GPU Target codePTX CodeCPU Codeld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;11Compiling CUDA extended C12http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdfThe nvcc Compiler – Suffix InfoFile suffix How the nvcc compiler interprets the file.cu CUDA source file, containing host and device code.cupPreprocessed CUDA source file, containing host code and device functions.c 'C’ source file.cc, .cxx, .cpp C++ source file.gpu GPU intermediate file (device code only).ptx PTX intermediate assembly file (device code only) .cubin CUDA device only binary file13CUDA API: Device Memory Allocation(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHost14 cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees object from device Global Memory Pointer to freed objectHK-UIUCExample Use: A Matrix Data Type NOT part of CUDA API It will be frequently used in many code examples 2 D matrix Single precision float elements Width * height elements Matrix entries attached to the pointer-to-float member called “elements” Matrix is stored row-wisetypedef struct {int width;int height;float* elements;} Matrix;15CUDA Device Memory Allocation (cont.) Code example: Allocate a 64 * 64 single precision float array Attach the allocated storage to Md.elements “d” is often used to indicate a device data structureBLOCK_SIZE = 64;Matrix Md;int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);cudaMalloc((void**)&Md.elements, size);…//use it for what you need, then free the device memorycudaFree(Md.elements);16HK-UIUCAll the details are spelled out in the CUDA Programming Guide 3.2(see the resources section of the class website)CUDA Host-Device Data Transfer cudaMemcpy() memory data
View Full Document