Slide 1Before We Get Started…Putting Things in Perspective…Slide 4What is an API?On the CUDA APITalking about the API: The C CUDA Software StackCUDA Function DeclarationsCUDA Function Declarations (cont.)Compiling CUDACompiling CUDACompiling CUDA extended CThe nvcc Compiler – Suffix InfoCUDA API: Device Memory AllocationExample Use: A Matrix Data TypeCUDA Device Memory Allocation (cont.)CUDA Host-Device Data TransferCUDA Host-Device Data Transfer (cont.)Slide 19Simple Example: Matrix MultiplicationSquare Matrix Multiplication ExampleMultiply Using One Thread BlockStep 1: Matrix Multiplication A Simple Host Code in CStep 2: Matrix Multiplication, Host-side. Main Program CodeStep 3: Matrix Multiplication Host-side codeStep 4: Matrix Multiplication- Device-side Kernel FunctionStep 5: Some Loose EndsThe Common Pattern to CUDA ProgrammingTiming Your ApplicationTiming Example Timing a query of device 0 propertiesME964High Performance Computing for Engineering ApplicationsMost of the time I don't have much fun. The rest of the time I don't have any fun at all. – Woody Allen© Dan Negrut, 2011ME964 UW-MadisonThe CUDA APIFebruary 08, 2011Before We Get Started…Last timeAndrew: wrap up building CUDA apps in Visual Studio 2008Andrew: running apps through the HPC scheduler on NewtonVery high-level overview of the CUDA programming modelDiscussed Index issues in the context of the “execution configuration” and how the index of a thread translates into an ID of a threadBrief discussion of the memory spaces in relation to GPU computingTodayDiscussion of the CUDA APIOne-on-one with Andrew if you have compile/build issues in CUDA3-5 PM in room 2042MEHWHW2: due date was 02/08. Now 02/10HW3 has been posted. Due date: 02/15 Small matrix-vector multiplicationMatrix addition – requires use of multiple blocks2Putting Things in Perspective…CUDA programming model and execution configurationBasic concepts and data types - just finished this… CUDA application programming interfaceWorking on it nextSimple example to illustrate basic concepts and functionalityComing up shortlyPerformance features will be covered later34The CUDA APIWhat is an API?Application Programming Interface (API)A set of functions, procedures or classes that an operating system, library, or service provides to support requests made by computer programs (from Wikipedia)Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc.Cooked up analogy (for the mechanical engineer)Think of a car, you can say it has a certain Device Operating Interface (DOI):A series of pedals, gauges, steering wheel, etc. This would be its DOIIn this context, CUDA provides an API that enables you to tap into the computational resources of the NVIDIA’s GPUsThis is what replaced the old GPGPU way of programming the hardware5On the CUDA APIReading the CUDA Programming Guide you’ll run numerous references to the CUDA Runtime API and CUDA Driver APIMany time they talk about “CUDA runtime” and “CUDA driver”. What they mean is CUDA Runtime API and CUDA Driver APICUDA Runtime API – is the friendly face that you can choose to see when interacting with the GPU. This is what gets identified with “C CUDA”Needs nvcc compiler to generate an executableCUDA Driver API – this is more like how it was back in the day: low level way of interacting with the GPUYou have significantly more control over the host-device interactionSignificantly clunkier way to dialogue with the GPU, typically only needs a C compilerI don’t anticipate any reason to use the CUDA Driver API6Talking about the API:The C CUDA Software StackImage at right indicates where the API fits in the picture7An API layer is indicated by a thick red line:NOTE: any CUDA runtime function has a name that starts with “cuda”Examples: cudaMalloc, cudaFree, cudaMemcpy, etc.Examples of CUDA Libraries: CUFFT, CUBLAS, CUSP, thrust, etc.CUDA Function DeclarationsExecuted on the:Only callable from the:__device__ float DeviceFunc()device device__global__ void KernelFunc()device host__host__ float HostFunc()host host__global__ defines a kernel functionMust return void__device__ and __host__ can be used together8CUDA Function Declarations (cont.) __device__ functions can’t have their address takenFor functions executed on the device:No recursionNo static variable declarations inside the functionNo variable number of argumentsSomething like printf would not work…9HK-UIUCCompiling CUDAAny source file containing CUDA language extensions must be compiled with nvccYou spot such a file by its .cu suffixnvcc is a compile driverWorks by invoking all the necessary tools and compilers like cudacc, g++, cl, ...nvcc can output:C codeMust then be compiled with the rest of the application using another toolptx code (CUDA’s ISA)Or directly object code (cubin)10Compiling CUDAnvccCompile driverInvokes cudacc, gcc, cl, etc.PTXParallel Thread eXecutionLike assembly languageNVIDIA’s ISANVCCC/C++ CUDAApplicationPTX to TargetCompiler G80 … GPU Target codePTX CodeCPU Codeld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;11Compiling CUDA extended C12http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdfThe nvcc Compiler – Suffix InfoFile suffix How the nvcc compiler interprets the file.cuCUDA source file, containing host and device code.cupPreprocessed CUDA source file, containing host code and device functions.c 'C’ source file.cc, .cxx, .cpp C++ source file.gpu GPU intermediate file (device code only).ptxPTX intermediate assembly file (device code only) .cubin CUDA device only binary file13CUDA API: Device Memory Allocation(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHost14cudaMalloc()Allocates object in the device Global MemoryRequires two parametersAddress of a pointer to the allocated objectSize of allocated objectcudaFree()Frees object from device Global MemoryPointer to freed objectHK-UIUCExample Use: A Matrix Data TypeNOT part of CUDA APIIt will be frequently used in many code
View Full Document