!"#$%&&%'() *') $+,") -%.%*+/) '#) 0+#-) 1'2%"&) '3) 2+#*) '#) +//) '3) *0%&) 4'#,) 3'#) 2"#&'(+/) '#1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-)3'#2#'3%*)'#)-%#"1*)1'$ $"#1%+/)+-6+(*+.")+(-)*0 +*)1'2%"&)&0'4)*0%&)('*%1")'()*0")3%#&*)2+.")'#%(%*%+/)&1#""()'3)+)-%&2/+ 8)+/'(.) 4%*0)*0")35//) 1%*+*%'(9):'28#%.0*&) 3'#)1'$2'("(*&) '3) *0%&4'#,)'4("-)78) ' * 0 " # & )*0+();:<)$5&*)7")0'('#"-9);7&*#+1*%(.)4%*0)1#"-%*)%&) 2 "# $ % * * " - 9 )='1'28) '*0"#4%&">) *') #"257/%&0>) *') 2'&*) '() &"#6"#&>) *') #"-%&*#%75*") *') /%&*&>) '#) *') 5&") +(81'$2'("(*) '3) *0%&) 4'#,) %()'*0"#) 4'#,&) #"?5%#"&) 2#%'#) &2"1%3%1) 2"#$%&& %'() +(-@'#)+) 3""9!"#$%&&%'(&)$+8)7")#"?5"&*"-)3#'$)!57/%1+*%'(&)A"2*9>);:<>)B(19>)CDCD)E#'+-4+8>)F"4G'#,>)FG)CHHIJ)KL;>)3+M)NC)OPCPQ)RJSTHURC>)'#)2"#$%&&%'(&V+1$9'#.9Brook for GPUs: Stream Computing on Graphics HardwareIan Buck Tim Foley Daniel Horn Jeremy Sugerman Kayvon Fatahalian Mike Houston Pat HanrahanStanford UniversityAbstractIn this pap er, we present Brook for GPUs, a systemfor general-purpose computation on programmable graphicshardware. Brook extends C to include simple data-parallelconstructs, enabling the use of the GPU as a streaming co-processor. We present a compiler and runtime system thatabstracts and virtualizes many aspects of graphics hardware.In addition, we present an analysis of the effectiveness of theGPU as a compute engine compared to the CPU, to deter-mine when the GPU can outperform the CPU for a particu-lar algorithm. We evaluate our system with five applications,the SAXPY and SGEMV BLAS operators, image segmen-tation, FFT, and ray tracing. For these applications, wedemonstrate that our Brook implementations perform com-parably to hand-written GPU code and up to seven timesfaster than their CPU counterparts.CR Categories: I.3.1 [Computer Graphics]: Hard-ware Architecture—Graphics processors D.3.2 [Program-ming Languages]: Language Classifications—Parallel Lan-guagesKeywords: Programmable Graphics Hardware, DataParallel Computing, Stream Computing, GPU Computing,Brook1 IntroductionIn recent years, commodity graphics hardware has rapidlyevolved from being a fixed-function pipeline into having pro-grammable vertex and fragment processors. While this newprogrammability was introduced for real-time shading, it hasb een observed that these processors feature instruction setsgeneral enough to perform computation beyond the domainof rendering. Applications such as linear algebra [Kr¨ugerand Westermann 2003], physical simulation, [Harris et al.2003], and a complete ray tracer [Purcell et al. 2002; Carret al. 2002] have been demonstrated to run on GPUs.Originally, GPUs could only be programmed using as-sembly languages. Microsoft’s HLSL, NVIDIA’s Cg, andOpenGL’s GLslang allow shaders to be written in a highlevel, C-like programming language [Microsoft 2003; Market al. 2003; Kessenich et al. 2003]. However, these lan-guages do not assist the programmer in controlling otheraspects of the graphics pipeline, such as allocating texturememory, loading shader programs, or constructing graphicsprimitives. As a result, the implementation of applicationsrequires extensive knowledge of the latest graphics APIs aswell as an understanding of the features and limitations ofmodern hardware. In addition, the user is forced to ex-press their algorithm in terms of graphics primitives, suchas textures and triangles. As a result, general-purpose GPUcomputing is limited to only the most advanced graphicsdevelopers.This paper presents Brook, a programming environmentthat provides developers with a view of the GPU as a stream-ing coprocessor. The main contributions of this paper are:• The presentation of the Brook stream programmingmodel for general-purpose GPU computing. Throughthe use of streams, kernels and reduction operators,Brook abstracts the GPU as a streaming processor.• The demonstration of how various GPU hardware lim-itations can be virtualized or extended using our com-piler and runtime system; specifically, the GPU mem-ory system, the numb er of supported shader outputs,and support for user-defined data structures.• The presentation of a cost model for comparing GPUvs. CPU performance tradeoffs to better understandunder what circumstances the GPU outperforms theCPU.2 Background2.1 Evolution of Streaming HardwareProgrammable graphics hardware dates back to the origi-nal programmable framebuffer architectures [England 1986].One of the most influential programmable graphics systemswas the UNC PixelPlanes series [Fuchs et al. 1989] culmi-nating in the PixelFlow machine [Molnar et al. 1992]. Thesesystems embedded pixel processors, running as a SIMD pro-cessor, on the same chip as framebuffer memory. Peercy etal. [2000] demonstrated how the OpenGL architecture [Wooet al. 1999] can be abstracted as a SIMD processor. Eachrendering pass implements a SIMD instruction that per-forms a basic arithmetic operation and updates the frame-buffer atomically. Using this abstraction, they were ableto compile RenderMan to OpenGL 1.2 with imaging exten-sions. Thompson et al. [2002] explored the use of GPUs asa general-purpose vector processor by implementing a soft-ware layer on top of the graphics library that performedarithmetic computation on arrays of floating point numbers.SIMD and vector processing operators involve a read, anexecution of a single instruction, and a write to off-chip mem-ory [Russell 1978; Kozyrakis 1999]. This results in signifi-cant memory bandwidth use. Today’s graphics hardwareexecutes small programs where instructions load and storedata to local temporary registers rather than to memory.This is a major difference between the vector and streamprocessor abstraction [Khailany et al. 2001].The stream programming model captures computationallocality not present in the SIMD or vector mo dels throughthe use of streams and kernels. A stream is a collectionof records requiring similar computation while kernels are777W)PHHU);:<)HXIHTHIHC@HU@HRHHTHXXX)YD9HHInput RegistersOutput RegistersConstantsTemp RegistersTexturesShaderProgramFigure 1: Programming model for current programmablegraphics hardware. A shader program operates on a singleinput element (vertex or fragment) stored in the input regis-ters and writes the execution result into the output registers.functions applied to each element of a stream. A streamingprocessor executes a kernel over all elements of an inputstream, placing the
View Full Document