UW-Madison ME 964 - ME 964 Lecture Notes - D3017730

Home> Schools> University of Wisconsin, Madison> Mechanical Engineering (ME) > ME 964> ME 964 Lecture Notes

DOC PREVIEW

UW-Madison ME 964 - ME 964 Lecture Notes

School name University of Wisconsin, Madison

Course Me 964- High Performance Computing for Engineering Applications

Pages 45

This preview shows page 1-2-3-21-22-23-43-44-45 out of 45 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 45 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1ME964High Performance Computing for Engineering Applications“Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second Law of Thermodynamics; i.e., it always increases.”– Norman Augustine© Dan Negrut, 2011ME964 UW-MadisonAsynchronous Concurrent Execution in CUDAHandling Multiple Streams in CUDAMarch 10, 20112Before We Get Started… Last time CUDA optimization rules of thumb Discussed two parallel implementations of the prefix scan operation Today Asynchronous Concurrent Execution in CUDA Using a CUDA stream and asynchronous mem copy to decouple CPU and GPU execution Handling Multiple Streams in CUDA as a means to enable task parallelism Other issues Syllabus firmed up, we’ll have three guest lecturers later in the semester23Concurrent Execution between Host and Device In order to facilitate concurrent execution between host and device, some function calls are asynchronous Control is returned to the host thread before the device has completed the requested task Examples of asynchronous calls Kernel launches Device ↔ device memory copies Host ↔ device memory copies of a memory block of 64 KB or less Memory copies performed by functions that are suffixed with Async NOTE: When an application is run via a CUDA debugger or profiler (cuda-gdb, CUDA Visual Profiler, Parallel Nsight), all launches are synchronous34Concurrent Kernel Execution[CUDA 3.2] Feature allows up to 16 kernels to be run on the device at the same time When is this useful? Devices of compute capability 2.x are pretty wide (large number of SMs) Sometimes you launch kernels whose execution configuration is smaller than the GPU’s “width” Then, two or three independent kernels can be “squeezed” on the GPU at the same time Represents one of GPU’s attempts to look like a MIMD architecture45Host-Device Data Transfer Issues In general, host ↔ device data transfers using cudaMemcpy() are blocking Control is returned to the host thread only after the data transfer is complete There is a non-blocking variant, cudaMemcpyAsync() The host does not wait on the device to finish the mem copy and the kernel call for it to start execution of cpuFunction() call The launch of “kernel” only happens after the mem copy call finishes NOTE: the asynchronous transfer version requires pinned host memory (allocated with cudaHostAlloc()), and it contains an additional argument (a stream ID)56Overlapping Host ↔↔↔↔ Device Data Transfer with Device Execution When is this overlapping useful? Imagine a kernel executes on the device and only works with the lower half of the device memory Then, you can copy data from host to device in the upper half of the device memory? These two operations can take place simultaneously Note that there is a issue with this idea:  The device execution stack is FIFO, one function call on the device is not serviced until all the previous device function calls completed This would prevent overlapping execution with data transfer This issue was addressed by the use of CUDA “streams”67CUDA Streams: Overview A programmer can manage concurrency through streams A stream is a sequence of CUDA commands that execute in order Look at a stream as a queue of GPU operations The execution order in a stream is identical to the order in which the GPU operations are added to the stream One host thread can define multiple CUDA streams  Think of a stream as a task that gets executed by the GPU. You can have multiple tasks and sometimes the GPU can execute parts of these tasks simultaneously What are the typical operations in a stream? Invoking a data transfer Invoking a kernel execution Handling events78CUDA Streams: Overview[Cntd.] With respect to each other, different CUDA streams execute their commands as they see fit  Inter-stream relative behavior is not guaranteed and should therefore not be relied upon for correctness (e.g. inter-kernel communication for kernels allocated to different streams is undefined) Another way to look at it: streams can by synchronized at barrier points, but correlation of sequence execution within different streams is not supported When thinking about the typical GPU operations in the stream (see previous slide), remember that the GPU hardware has two types of engines One or more engines for copy operations One engine to execute kernels The fact that there are two hardware engines becomes relevant in relation to how you organize the queuing of GPU operations in a stream For improved efficiency you want to have these two engines work simultaneously…89CUDA Streams: Creation A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host ↔ device memory copies.  The following code sample creates two streams and allocates an array “hostPtr” of float in page-locked memory hostPtr will be used in asynchronous host ↔ device memory transfers910CUDA Streams: Making of Use of Them In the code below, each of two streams is defined as a sequence of  One memory copy from host to device,  One kernel launch, and  One memory copy from device to host There are some wrinkles to it, we’ll revisit shortly…1011CUDA Streams: Clean Up Phase Streams are released by calling cudaStreamDestroy() Note that cudaStreamDestroy() waits for all preceding commands in the given stream to complete before destroying the stream and returning control to the host thread1112CUDA Streams: Caveats Two commands from different streams cannot run concurrently if either one of the following operations is issued in-between them by the host thread: A page-locked host memory allocation, A device memory allocation, A device memory set, A device ↔ device memory copy, Any CUDA command to stream 0 (including kernel launches and host ↔device memory copies that do not specify any stream parameter) A switch between the L1/shared memory configurations1213CUDA Streams: Synchronization Aspects cudaThreadSynchronize() waits until all preceding commands in all streams have completed. cudaStreamSynchronize() takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other

View Full Document