MIT 6 375 - Implementing the Graphics Pipeline on a Heterogeneous Multicore - D2256778

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 375> Implementing the Graphics Pipeline on a Heterogeneous Multicore

DOC PREVIEW

MIT 6 375 - Implementing the Graphics Pipeline on a Heterogeneous Multicore

School name Massachusetts Institute of Technology

Course 6 375- Complex Digital Systems

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Implementing the Graphics Pipeline on a Heterogeneous MulticoreJiawen Chen Jonathan Ragan-Kelley1 IntroductionGeneral-purp os e microprocessors (CPUs) are undergoing a radical change of direction, from traditionallynarrow designs optimized for single-threaded ILP, towards exploiting far greater thread-level- and data-parallelism.Meanwhile, more than a decade ago, real-time graphics emerged among the first successful commodityapplications of data-parallel processors. In contrast to microprocessors, graphics processors (GPUs) lever-aged highly-specialized, application-specific architectures. More recently, GPUs have become programmable,dominated by general-purpose pro ce ss ing units which look increasingly similar to highly parallel CPUs.As the two architectures become similar, there are strong efficiency and economic advantages to conver-gence. Shared design, verification, and fabrication efforts pro duce massive economies of scale. At the sametime, shared physical hardware, buses, and memories increase static hardware utilization across all applica-tions: since there is no specialized hardware going unused, in principle, adding general-purpose transistorsand bandwidth for one application improves the performance of all applications. Finally, unified resourcessmooth out dynamic load imbalance to dramatically increase total throughput in intensive, variable data-rateapplications like graphics.Nevertheless, key parts of the graphics pipeline still benefit significantly from fixed-function logic imple-mentations because they are massively arithmetically parallel, can be deeply pipelined due to due to a lackof inter-computation dependencies, and require minimal control overhead.We explore the convergence of graphics processor architectures with conventional microprocessor archi-tectures. A key point in this design space are heterogeneous manycore processors, with graphics running asa hybrid software/hardware workload across different types of cores . We implement a basic programmablegraphics pipeline on a simple, scalable multicore CPU architecture aiming towards GPU-competitive perfor-mance in graphics applications. We use graphics as a motivating workload for effectively combining generalpurp os e processors with specialized and fixed-function elements, and understanding the more general issuesin efficiently implementing complex parallel applications on such an architecture.2 The Graphics PipelineWe first introduce a simplified, logical version of the modern graphics pipeline (Fig. 1). We consider thepipeline to consist of four stages: vertex processing, rasterization, pixel processing, and raster operations. Theapplication submits vertices into the pipeline, which typically contain several attributes, including position,normal, color, and other interpolants. Vertices are processed in a 1-in-1-out fashion by a user-programmablevertex shader. Rasterization interprets triplets of the processed vertices as triangles and interpolates vertexattributes at each pixel to output a (variable) number of fragments. Each fragment contains a screen spaceposition and the value of each vertex attribute linearly interpolated across the triangle at that point. Thesefragments are fed to the fragment shader, which executes a user-supplied program that computes color anddepth based on the interpolated information. Finally, the fragment color and depth are output to the rasteroperations unit (raster ops or ROPs), which tests depth against the framebuffer to determine visibility, andconditionally updates depth and blends color into the framebuffer.Semantically, the pipeline executes in-order. The incoming vertex stream has obvious ordering require-ments: triples of vertices are interpreted as a triangle by the rasterize r. If multiple vertex shaders execute in1RasterOperationsRasterizerVertex ShaderFragment Shaderverticesimagetransformed verticesfragment interpolantsshaded fragmentspixel colorand zFrameBuffervertex shader programfragment shader programFigure 1: A simple logical graphics pipeline. The application specifies a stream of vertices describing the3D scene geometry (left), along with vertex and fragment shader programs and parameters, to ultimatelygenerate an image (right). The vertex and fragment shader stages (orange) are the only programmableelements. They execute general-purpose user-defined programs, but can only read from memory. The rasteroperations perform read-modify-write operations against the framebuffer memory. They are configurable(brown), but not general-purpose programmable, and are the only stage which writes memory. Rasterizationis n ot programmable (green), performs no memory access, and has extremely high arithmetic intensity.parallel and out of order, the rasterizer must reorder its inputs. Similarly, fragments must commit in orderto the raster ops to ensure correct blending (e.g. for transparency).3 High-level Desig nWe start out with a scalable, throughput-oriented manycore architecture in the style of the MIT RAWprocessor [Taylor et al. 2002] (Fig. 2, left).SMIPS(shader)SMIPS(shader)SMIPS(ROP)RasterizerSMIPSSMIPSSMIPSSMIPSClassical Multicore Heterogeneous Multicore(executing graphics)verticesimageFigure 2: Tiled chip-multiprocessors. Left: a classical multicore CPU, with many identical general-purposeprocessors (orange). Right: a heterogeneous multicore with one general-purpose processor replaced by a fixed-function rasterizer block (green) for efficiency, shown executing the graphics pipeline in Fig. 1. The ROP tileexecutes a configurable fixed-function pipeline stage in software on a general-purpose processor tile (dottedorange/brown).This design employs a 2D tiled network where tiles are decoupled both electrically in hardware, andsemantically in the Bluespec definition, through Connectable Get/Put interfaces to their immediate neigh-bors. This enables electrical scalability by replacing global wires with local, neighbor-to-neighbor network2links, as well as transparently allowing tiles to operate in independent clock domains. It further enables tilesto be built and verified independently, and synthesized in new configurations.Pure software graphics implementations on these architectures are known to be disproportionately dom-inated by the cost of some traditionally fixed-function stages, particularly rasterization [Chen et al. 2005],preventing them from even approaching GPU-competitive performance in equivalent area. We thereforeexplore a simple extension to such a

View Full Document