UW-Madison ME 964 - Thrust - A Productivity-Oriented Library for CUDA

Unformatted text preview:

Thrust: A Productivity-Oriented Library for CUDAMotivationDiving InIterators and Memory SpacesInteroperabilityGeneric ProgrammingBenefits of AbstractionProgrammer ProductivityRobustnessReal-World PerformanceBest PracticesFusionStructure of ArraysImplicit RangesReferencesHWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 359 #1CHAPTERThrust: AProductivity-OrientedLibrary for CUDA26Nathan Bell and Jared HoberockThis chapter demonstrates how to leverage the Thrust parallel template library to implement high-performance applications with minimal programming effort. Based on the C++ Standard TemplateLibrary (STL), Thrust brings a familiar high-level interface to the realm of GPU Computing whileremaining fully interoperable with the rest of the CUDA software ecosystem. Applications writtenwith Thrust are concise, readable, and efficient.26.1 MOTIVATIONWith the introduction of CUDA C/C++, developers can harness the massive parallelism of the GPUthrough a standard programming language. CUDA allows developers to make fine-grained decisionsabout how computations are decomposed into parallel threads and executed on the device. The levelof control offered by CUDA C/C++ (henceforth CUDA C) is an important feature: it facilitates thedevelopment of high-performance algorithms for a variety of computationally demanding tasks which(1) merit significant optimization and (2) profit from low-level control of the mapping onto hardware.For this class of computational tasks CUDA C is an excellent solution.Thrust [1] solves a complementary set of problems, namely those that are (1) implemented effi-ciently without a detailed mapping of work onto the target architecture or those that (2) do not meritor simply will not receive significant optimization effort by the user. With Thrust, developers describetheir computation using a collection of high-level algorithms and completely delegate the decisionof how to implement the computation to the library. This abstract interface allows programmers todescribe what to compute without placing any additional restrictions on how to carry out the computa-tion. By capturing the programmer’s intent at a high level, Thrust has the discretion to make informeddecisions on behalf of the programmer and select the most efficient implementation.The value of high-level libraries is broadly recognized in high-performance computing. Forexample, the widely-used BLAS standard provides an abstract interface to common linear algebraoperations. First conceived more than three decades ago, BLAS remains relevant today in large partbecause it allows valuable, platform-specific optimizations to be introduced behind a uniform interface.Whereas BLAS is focused on numerical linear algebra, Thrust provides an abstract interface tofundamental parallel algorithms such as scan, sort, and reduction. Thrust leverages the power of C++templates to make these algorithms generic, enabling them to be used with arbitrary user-definedtypes and operators. Thrust establishes a durable interface for parallel computing with an eye towardsgenerality, programmer productivity, and real-world performance.GPU Computing Gemsc 2012 Elsevier Inc. All rights reserved.359Appears in GPU Computing Gems: Jade Edition Published 2011 by Morgan Kaufmann Publishers http://mkp.com/news/3405HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 360 #2360 CHAPTER 26 Thrust: A Productivity-Oriented Library for CUDA26.2 DIVING INBefore going into greater detail, let us consider the program in Listing 26.1, which illustrates the salientfeatures of Thrust.#include <thrust/host vector.h>#include <thrust/device vector.h>#include <thrust/generate.h>#include <thrust/sort.h>#include <thrust/copy.h>#include <cstdlib>int main(void){// generate 16M random numbers on the hostthrust::host vector<int> h vec(1 << 24);thrust::generate(h vec.begin(), h vec.end(), rand);// transfer data to the devicethrust::device vector<int> d vec = h vec;// sort data on the devicethrust::sort(dvec.begin(), d vec.end());// transfer data back to hostthrust::copy(d vec.begin(), d vec.end(), h vec.begin());return 0;}Listing 26.1. A complete Thrust program which sorts data on the GPU.Thrust provides two vector containers: host vector and device vector. As the names sug-gest, host vector is stored in host memory while device vector lives in device memory on theGPU. Like the vector container in the C++ STL, host vector and device vector are generic con-tainers (i.e., they are able to store any data type) that can be resized dynamically. As the example shows,containers automate the allocation and deallocation of memory and simplify the process of exchangingdata between the host and device.The program acts on the vector containers using the generate, sort, and copy algorithms. Here,we adopt the STL convention of specifying ranges using pairs of iterators. In this example, the iteratorsh vec.begin() and h vec.end() can be thought of as a pair of int pointers, where the formerpoints to the first element in the array and the latter to the element one past the end of the array.Together the pair defines a range of integers of size h vec.end() - h vec.begin().Note that even though the computation implied by the call to the sort algorithm suggests one ormore CUDA kernel launches, the programmer has not specified a launch configuration. Thrust’s inter-face abstracts these details. The choice of performance-sensitive variables such as grid and block size,HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 361 #326.2 Diving In 361the details of memory management, and even the choice of sorting algorithm are left to the discretionof the library implementor.26.2.1 Iterators and Memory SpacesAlthough vector iterators are similar to pointers, they carry additional information. Notice that we didnot have to instruct the sort algorithm that it was operating on the elements of a device vectoror hint that the copy was from device memory to host memory. In Thrust the memory spaces ofeach range are automatically inferred from the iterator arguments and used to dispatch the appropriateimplementation.In addition to memory space, Thrust’s iterators implicitly encode a wealth of information which canguide the dispatch process. For instance, our sort example above operates on ints, a primitive datatype with a fundamental comparison operation. In this case, Thrust dispatches a highly-tuned RadixSort algorithm [2] which is considerably faster than alternative comparison-based sorting


View Full Document

UW-Madison ME 964 - Thrust - A Productivity-Oriented Library for CUDA

Documents in this Course
Load more
Download Thrust - A Productivity-Oriented Library for CUDA
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Thrust - A Productivity-Oriented Library for CUDA and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Thrust - A Productivity-Oriented Library for CUDA 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?