UW-Madison ME 964 - High Performance Computing for Engineering Applications

Unformatted text preview:

Slide 1Before We Get Started…Putting Things in Perspective…Slide 4What is an API?On the CUDA APITalking about the API: The C CUDA Software StackCUDA Function DeclarationsCUDA Function Declarations (cont.)Compiling CUDACompiling CUDACompiling CUDA extended CThe nvcc Compiler – Suffix InfoCUDA API: Device Memory AllocationExample Use: A Matrix Data TypeCUDA Device Memory Allocation (cont.)CUDA Host-Device Data TransferCUDA Host-Device Data Transfer (cont.)Slide 19Simple Example: Matrix MultiplicationSquare Matrix Multiplication ExampleMultiply Using One Thread BlockStep 1: Matrix Multiplication A Simple Host Code in CStep 2: Matrix Multiplication, Host-side. Main Program CodeStep 3: Matrix Multiplication Host-side codeStep 4: Matrix Multiplication- Device-side Kernel FunctionStep 5: Some Loose EndsThe Common Pattern to CUDA ProgrammingTiming Your ApplicationTiming Example Timing a query of device 0 propertiesME964High Performance Computing for Engineering ApplicationsMost of the time I don't have much fun. The rest of the time I don't have any fun at all. – Woody Allen© Dan Negrut, 2011ME964 UW-MadisonThe CUDA APIFebruary 08, 2011Before We Get Started…Last timeAndrew: wrap up building CUDA apps in Visual Studio 2008Andrew: running apps through the HPC scheduler on NewtonVery high-level overview of the CUDA programming modelDiscussed Index issues in the context of the “execution configuration” and how the index of a thread translates into an ID of a threadBrief discussion of the memory spaces in relation to GPU computingTodayDiscussion of the CUDA APIOne-on-one with Andrew if you have compile/build issues in CUDA3-5 PM in room 2042MEHWHW2: due date was 02/08. Now 02/10HW3 has been posted. Due date: 02/15 Small matrix-vector multiplicationMatrix addition – requires use of multiple blocks2Putting Things in Perspective…CUDA programming model and execution configurationBasic concepts and data types - just finished this… CUDA application programming interfaceWorking on it nextSimple example to illustrate basic concepts and functionalityComing up shortlyPerformance features will be covered later34The CUDA APIWhat is an API?Application Programming Interface (API)A set of functions, procedures or classes that an operating system, library, or service provides to support requests made by computer programs (from Wikipedia)Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc.Cooked up analogy (for the mechanical engineer)Think of a car, you can say it has a certain Device Operating Interface (DOI):A series of pedals, gauges, steering wheel, etc. This would be its DOIIn this context, CUDA provides an API that enables you to tap into the computational resources of the NVIDIA’s GPUsThis is what replaced the old GPGPU way of programming the hardware5On the CUDA APIReading the CUDA Programming Guide you’ll run numerous references to the CUDA Runtime API and CUDA Driver APIMany time they talk about “CUDA runtime” and “CUDA driver”. What they mean is CUDA Runtime API and CUDA Driver APICUDA Runtime API – is the friendly face that you can choose to see when interacting with the GPU. This is what gets identified with “C CUDA”Needs nvcc compiler to generate an executableCUDA Driver API – this is more like how it was back in the day: low level way of interacting with the GPUYou have significantly more control over the host-device interactionSignificantly clunkier way to dialogue with the GPU, typically only needs a C compilerI don’t anticipate any reason to use the CUDA Driver API6Talking about the API:The C CUDA Software StackImage at right indicates where the API fits in the picture7An API layer is indicated by a thick red line:NOTE: any CUDA runtime function has a name that starts with “cuda”Examples: cudaMalloc, cudaFree, cudaMemcpy, etc.Examples of CUDA Libraries: CUFFT, CUBLAS, CUSP, thrust, etc.CUDA Function DeclarationsExecuted on the:Only callable from the:__device__ float DeviceFunc()device device__global__ void KernelFunc()device host__host__ float HostFunc()host host__global__ defines a kernel functionMust return void__device__ and __host__ can be used together8CUDA Function Declarations (cont.) __device__ functions can’t have their address takenFor functions executed on the device:No recursionNo static variable declarations inside the functionNo variable number of argumentsSomething like printf would not work…9HK-UIUCCompiling CUDAAny source file containing CUDA language extensions must be compiled with nvccYou spot such a file by its .cu suffixnvcc is a compile driverWorks by invoking all the necessary tools and compilers like cudacc, g++, cl, ...nvcc can output:C codeMust then be compiled with the rest of the application using another toolptx code (CUDA’s ISA)Or directly object code (cubin)10Compiling CUDAnvccCompile driverInvokes cudacc, gcc, cl, etc.PTXParallel Thread eXecutionLike assembly languageNVIDIA’s ISANVCCC/C++ CUDAApplicationPTX to TargetCompiler G80 … GPU Target codePTX CodeCPU Codeld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;11Compiling CUDA extended C12http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdfThe nvcc Compiler – Suffix InfoFile suffix How the nvcc compiler interprets the file.cuCUDA source file, containing host and device code.cupPreprocessed CUDA source file, containing host code and device functions.c 'C’ source file.cc, .cxx, .cpp C++ source file.gpu GPU intermediate file (device code only).ptxPTX intermediate assembly file (device code only) .cubin CUDA device only binary file13CUDA API: Device Memory Allocation(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHost14cudaMalloc()Allocates object in the device Global MemoryRequires two parametersAddress of a pointer to the allocated objectSize of allocated objectcudaFree()Frees object from device Global MemoryPointer to freed objectHK-UIUCExample Use: A Matrix Data TypeNOT part of CUDA APIIt will be frequently used in many code


View Full Document

UW-Madison ME 964 - High Performance Computing for Engineering Applications

Documents in this Course
Load more
Download High Performance Computing for Engineering Applications
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view High Performance Computing for Engineering Applications and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view High Performance Computing for Engineering Applications 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?