DOC PREVIEW
UW-Madison ME 964 - Lecture 0208

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

ME964High Performance Computing for Engineering ApplicationsMost of the time I don't have much fun. The rest of the time I don't have any fun at all. – Woody Allen© Dan Negrut, 2011ME964 UW-MadisonThe CUDA APIFebruary 08, 2011Before We Get Started… Last time Andrew: wrap up building CUDA apps in Visual Studio 2008 Andrew: running apps through the HPC scheduler on Newton Very high-level overview of the CUDA programming model Discussed Index issues in the context of the “execution configuration” and how the index of a thread translates into an ID of a thread Brief discussion of the memory spaces in relation to GPU computing Today Discussion of the CUDA API One-on-one with Andrew if you have compile/build issues in CUDA 3-5 PM in room 2042ME HW HW2: due date was 02/08. Now 02/10 HW3 has been posted. Due date: 02/15  Small matrix-vector multiplication Matrix addition – requires use of multiple blocks2Putting Things in Perspective… CUDA programming model and execution configuration Basic concepts and data types - just finished this…  CUDA application programming interface Working on it next Simple example to illustrate basic concepts and functionality Coming up shortly Performance features will be covered later34The CUDA APIWhat is an API? Application Programming Interface (API) A set of functions, procedures or classes that an operating system, library, or service provides to support requests made by computer programs (from Wikipedia) Example: OpenGL, a graphics library, has its own API that allows one to draw a line, rotate it, resize it, etc. Cooked up analogy (for the mechanical engineer) Think of a car, you can say it has a certain Device Operating Interface (DOI): A series of pedals, gauges, steering wheel, etc. This would be its DOI In this context, CUDA provides an API that enables you to tap into the computational resources of the NVIDIA’s GPUs This is what replaced the old GPGPU way of programming the hardware5On the CUDA API Reading the CUDA Programming Guide you’ll run numerous references to the CUDA Runtime API and CUDA Driver API Many time they talk about “CUDA runtime” and “CUDA driver”. What they mean is CUDA Runtime API and CUDA Driver API CUDA Runtime API – is the friendly face that you can choose to see when interacting with the GPU. This is what gets identified with “C CUDA” Needs nvcc compiler to generate an executable CUDA Driver API – this is more like how it was back in the day: low level way of interacting with the GPU You have significantly more control over the host-device interaction Significantly clunkier way to dialogue with the GPU, typically only needs a C compiler I don’t anticipate any reason to use the CUDA Driver API6Talking about the API:The C CUDA Software Stack Image at right indicates where the API fits in the picture7An API layer is indicated by a thick red line: NOTE: any CUDA runtime function has a name that starts with “cuda” Examples: cudaMalloc, cudaFree, cudaMemcpy, etc. Examples of CUDA Libraries: CUFFT, CUBLAS, CUSP, thrust, etc.CUDA Function DeclarationsExecuted on the:Only callable from the:__device__ float DeviceFunc()device device__global__ void KernelFunc()device host__host__ float HostFunc()host host __global__ defines a kernel function Must return void __device__ and __host__ can be used together8CUDA Function Declarations (cont.) __device__ functions can’t have their address taken For functions executed on the device: No recursion No static variable declarations inside the function No variable number of arguments Something like printf would not work…9HK-UIUCCompiling CUDA Any source file containing CUDA language extensions must be compiled with nvcc You spot such a file by its .cu suffix nvcc is a compile driver Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... nvcc can output: C code Must then be compiled with the rest of the application using another tool ptx code (CUDA’s ISA) Or directly object code (cubin)10Compiling CUDA nvcc Compile driver Invokes cudacc, gcc, cl, etc. PTX Parallel Thread eXecution Like assembly language NVIDIA’s ISANVCCC/C++ CUDAApplicationPTX to TargetCompilerG80 … GPU Target codePTX CodeCPU Codeld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;11Compiling CUDA extended C12http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdfThe nvcc Compiler – Suffix InfoFile suffix How the nvcc compiler interprets the file.cu CUDA source file, containing host and device code.cupPreprocessed CUDA source file, containing host code and device functions.c 'C’ source file.cc, .cxx, .cpp C++ source file.gpu GPU intermediate file (device code only).ptx PTX intermediate assembly file (device code only) .cubin CUDA device only binary file13CUDA API: Device Memory Allocation(Device) GridConstantMemoryTextureMemoryGlobalMemoryBlock (0, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersBlock (1, 0)Shared MemoryLocalMemoryThread (0, 0)RegistersLocalMemoryThread (1, 0)RegistersHost14 cudaMalloc() Allocates object in the device Global Memory Requires two parameters Address of a pointer to the allocated object Size of allocated object cudaFree() Frees object from device Global Memory Pointer to freed objectHK-UIUCExample Use: A Matrix Data Type NOT part of CUDA API It will be frequently used in many code examples 2 D matrix Single precision float elements Width * height elements Matrix entries attached to the pointer-to-float member called “elements” Matrix is stored row-wisetypedef struct {int width;int height;float* elements;} Matrix;15CUDA Device Memory Allocation (cont.) Code example:  Allocate a 64 * 64 single precision float array Attach the allocated storage to Md.elements “d” is often used to indicate a device data structureBLOCK_SIZE = 64;Matrix Md;int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);cudaMalloc((void**)&Md.elements, size);…//use it for what you need, then free the device memorycudaFree(Md.elements);16HK-UIUCAll the details are spelled out in the CUDA Programming Guide 3.2(see the resources section of the class website)CUDA Host-Device Data Transfer cudaMemcpy() memory data


View Full Document

UW-Madison ME 964 - Lecture 0208

Documents in this Course
Load more
Download Lecture 0208
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 0208 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 0208 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?