UT CS 378 - Scan Primitives for GPU Computing - D2894423

Home> Schools> University of Texas at Austin> Computer Science (CS) > CS 378> Scan Primitives for GPU Computing

DOC PREVIEW

UT CS 378 - Scan Primitives for GPU Computing

School name University of Texas at Austin

Course Cs 378- Declarative Programming

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Graphics Hardware (2007)Timo Aila and Mark Segal (Editors)Scan Primitives for GPU ComputingShubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. OwensUniversity of California, DavisNVIDIA CorporationAbstractThe scan primitives are powerful, general-purpose data-parallel primitives that are building blocks for a broadrange of applications. We describe GPU implementations of these primitives, specifically an efficient formulationand implementation of segmented scan,onNVIDIAGPUsusingtheCUDAAPI.Usingthescanprimitives,weshow novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performanceof the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluidsimulation using the scan framework for a tridiagonal matrix solver.1. Introduction and MotivationBy the end of 2007, the most advanced processors will sur-pass one billion transistors on a single chip. This wealth ofcomputational resources has yielded GPUs that can sustainhundreds of GFLOPS on not just graphics applications butalso a wide variety of computationally demanding general-purpose problems.The primary reason that GPUs deliver such high per-formance is that the GPU is a highly parallel machine.NVIDIA’s latest flagship GPU, for instance, boasts 128 pro-cessors. GPUs keep these processors busy by juggling thou-sands of parallel computational threads. Graphics workloadsare perfectly suited to delivering large amounts of fine-grained parallel work. The programmable parts of the graph-ics pipeline operate on primitives (vertices and fragments),with hundreds of thousands to millions of each type of prim-itive in a typical frame. These primitive programs spawn athread for each primitive to keep the parallel processors full.It is instructive to look at the graphics programmingmodel as the GPU becomes more general purpose. Let usconsider fragment processing as a representative example.In both the OpenGL and DirectX APIs, fragment processingis completely data-parallel. The fragment processors iterateover each input fragment, producing a fixed number of out-puts per fragment that depend only on the incoming frag-ment (Lefohn et al. call this access pattern a “single-accessiterator” [LKS∗06]). It is this explicit data parallelism thatenables the effective use of so many parallel processors inrecent GPUs.This explicit parallelism, in turn, is well suited for astream programming model in which a single program (akernel) operates in parallel on each input element, pro-ducing one output element for each input element. Theprogrammable fragment and vertex parts of the graphicspipeline precisely match this strict stream programmingmodel. The stream model extends nicely to problems whereeach output element is a function not just of a single in-put but of a small, bounded neighborhood of inputs (a“neighborhood-access iterator”). Most general-purpose ap-plications that have been mapped efficiently to GPUs fitnicely into this more general stream programming model(for instance, particle systems, image processing, grid-basedfluid simulations, and dense matrix algebra).1.1. More Complex Operations are NeededConsider a fragment program that operates on n fragments.With a single-access iterator, each output fragment must ac-cess a single input fragment; with a neighborhood-access it-erator, each output fragment must access a small boundednumber of input fragments. The total number of accessesnecessary to compute all fragments is O(n).Many interesting problems, however, have more complexaccess requirements than we can support with a single- orneighborhood-access iterator. It is these problems that weaddress in this paper. A common algorithmic pattern thatarises in many parallel applications with complex access re-quirements is the prefix-sum algorithm. The input to prefix-sum is an array of values. The output is an equally sized arrayin which each element is the sum of all values that precededit in the input array:Copyrightc 2007 by the Association for Computing Machinery, Inc.Permission to make digital or hard copies of part or all of this work for personal or class-room use is granted without fee provided that copies are not made or distributed for com-mercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than ACM must be honored. Ab-stracting with credit is permitted. To copy otherwise, to republish, to post on servers, orto redistribute to lists, requires prior specific permission and/or a fee. Request permissionsfrom Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected] 2007, San Diego, California, August 04 - 05, 2007c 2007 ACM 978-1-59593-625-7/07/0008 $ 5.00Sengupta, Harris, Zhang, and Owens / Scan Primitives for GPU Computingin:31704163out:0341111141622How do we compute the value of the last element in theoutput (22)? That value is a function of every value in theinput stream. With a serial processor, such a computation istrivial, but with a parallel processor, it is more difficult. Thenaive way to compute each output in parallel is for every el-ement in the output stream to sum all preceding values fromthe input stream. This approach requires O(n2) total memoryaccesses and addition operations. This cost is too high.1.2. Scan: An Efficient Parallel PrimitiveWe are interested in finding efficient solutions to parallelproblems in which each output requires global knowledge ofthe inputs. We attack these problems using a family of algo-rithms called the scan primitives. Our scan implementationhas a serial work complexity of O(n). While the standardscan primitive was introduced to the GPU by Horn [Hor05],in this paper we introduce the segmented scan primitive tothe GPU, and present new approaches to implementing sev-eral classic applications using the scan primitives.2. PrimitivesWe chose NVIDIA’s CUDA GPU Computing environ-ment [NVI07] for our implementation. CUDA provides adirect, general-purpose C language interface to the pro-grammable processors on NVIDIA’s 8-series GPUs, elim-inating much of the complexity of writing GPGPU appli-cations using graphics APIs such as OpenGL. Furthermore,CUDA exposes some important new hardware features thathave large benefits to the performance of data-parallel com-putations:General Load-Store Memory Architecture CUDA al-lows arbitrary gather and scatter memory access fromGPU

View Full Document