Princeton ELE 572 - A Coprocessor for Streaming Multimedia Acceleration

Unformatted text preview:

PipeRench: A Coprocessor for Streaming Multimedia AccelerationSeth Copen Goldstein†Herman Schmit∗Matthew Moe∗Mihai Budiu†Srihari Cadambi∗R. Reed Taylor∗Ronald Laufer∗School of Computer Science†and Department of ECE∗Carnegie Mellon UniversityPittsburgh, PA 15213†{seth,mihaib}@cs.cmu.edu∗{herman,moe,cadambi,rt2i,rel}@ece.cmu.eduAbstractFuture computing workloads will emphasize an archi-tecture’s ability to perform relatively simple calculationson massive quantities of mixed-width data. This pa-per describes a novel reconfigurable fabric architecture,PipeRench, optimized to accelerate these types of computa-tions. PipeRench enables fast, robust compilers, supportsforward compatibility, and virtualizes configurations, thusremoving the fixed size constraint present in other fabrics.For the first time we explore how the bit-widthof processingelements affects performance and show how the PipeRencharchitecture has been optimized to balance the needs ofthe compiler against the realities of silicon. Finally, wedemonstrate extreme performance speedup on certain com-putingkernels(up to190xversus amodernRISC processor),and analyze how this acceleration translates to applicationspeedup.1. IntroductionWorkloads for computing devices are rapidly changing.On the desktop, the integration of digital media has madereal-time media processing the primary challenge for ar-chitects [10]. Embedded and wireless computing devicesneed to process copious data streaming from sensors andreceivers. These changes emphasize simple, regular com-putationsonlarge sets of small data elements. There aretwoimportant respects in which this need does not match theprocessing strengths of conventional processors. First, thesize of the data elements underutilizes the processor’s widedatapath. Second, the instruction bandwidth is much higherthan it needs to be to perform regular, dataflow-dominatedcomputations on large data sets.Both of these problems are being addressed throughpro-cessor architecture. Most recent ISAs have multimedia in-struction set extensions that allow a wide datapath to beswitched into SIMD operation [19]. The instruction band-width issue has created renewed interest in vector process-ing [14, 27].A fundamentally different way of addressing these prob-lems is to configure connections between programmablelogic elements and registers in order to construct an effi-cient, highly parallel implementation of the processing ker-nel. This interconnected network of processing elements iscalled a reconfigurable fabric, and the data set used to pro-gram the interconnect and processing elements is a configu-ration. After a configuration is loaded into a reconfigurablefabric, there is no further instruction bandwidth required toperform the computation. Furthermore, because the oper-ations are composed of small basic elements, the size ofthe processing elements can closely match the required datasize. This approach is called reconfigurable computing.Despite reports of amazing performance [11], reconfig-urable computing has not been accepted as a mainstreamcomputing technology because most previous efforts werebased upon, or inspired by, commercial FPGAs and fail tomeet the requirements of the marketplace. The problemsinherent in using standard FPGAs include1. Logic granularity: FPGAs are designed for logic re-placement. The granularity of the functional units isoptimized to replace randomlogic,not to perform mul-timedia computations.2. Configuration time: The time it takes to load a con-figuration in the fabric is called configuration time.In commercial FPGAs, configuration times range fromhundredsofmicroseconds to hundredsofmilliseconds.To show a performance improvement this start-up la-tency must be amortized over huge data sets, whichlimits the applicability of the technique.3. Forward-compatibility: FPGAs require redesign orrecompilation to gain benefit from future generationsof the chip.4. Hard constraints: FPGAs can implement only ker-nels of a fixed and relatively small size. This is partof the reason that compilation is difficult—everythingmust fit. It also causes large and unpredictable discon-tinuities between kernel size and performance.5. Compilationtime: Currentlythe synthesis, placementand routing phases of designs take hundreds of timeslonger than what the compilation of the same kernelwould take for a general-purpose processor.This paper describes PipeRench, a reconfigurable fab-ric designed to increase performance on future computingworkloads. PipeRench realizes the performance promises ofreconfigurable computing while solving the problems out-lined above. PipeRench uses a technique called pipelinereconfiguration to solve the problems of compilability, re-configuration time, and forward-compatibility. The archi-tectural parameters of PipeRench, including the logic blockgranularity, were selected to optimize the performance of asuite of kernels, balancing the needs of a compiler againstdesign realities in deep-submicron process technology.PipeRench is currently used as an attached processor.This places significant limitations on the types of applica-tions that can realize speedup, due to limited bandwidth be-tween PipeRench, the main memory and the processor. Webelievethisrepresentsthe initialphase intheevolutionofre-configurable processors. Just as floating-point computationmigrated from software emulation, to attached processors,to coprocessors, and finally to full incorporation into pro-cessor ISAs, so will reconfigurable computingeventually beintegrated into the CPU.In the next section, we use several examples to illustratethe advantages and architectural requirements of reconfig-urable fabrics. We introduce the idea of pipeline reconfigu-ration in Section 3, and describe how this technique solvesthe practical problems faced by reconfigurable computing.Section 4 describes a class of architectures that can imple-ment pipelined reconfiguration. We evaluate these architec-tures in Section 5. We cover related work in Section 6, andin Section 7 we summarize and discuss future research.2. Reconfigurable Computing2.1. Attributes of Target KernelsFunctions for which a reconfigurable fabric can providea significant benefit exhibit one or more of the followingfeatures:1. The function operates on bit-widths that are differentfrom the processor’s basic word size.2. The data dependencies in the function allow multiplefunction units to operate in parallel.for (int i=0; i<maxInput;


View Full Document
Download A Coprocessor for Streaming Multimedia Acceleration
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view A Coprocessor for Streaming Multimedia Acceleration and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view A Coprocessor for Streaming Multimedia Acceleration 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?