This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Exploiting Coarse-Grained Task, Data,and Pipeline Parallelism in Stream ProgramsMichael I. Gordon, William Thies, and Saman AmarasingheMassachusetts Institute of TechnologyComputer Science and Artificial Intelligence Laboratory{mgordon, thies, saman}@mit.eduAbstractAs multicore architectures enter the mainstream, there is a press-ing demand for high-level programming models that can effectivelymap to them. Stream programming offers an attractive way to ex-pose coarse-grained parallelism, as streaming applications (image,video, DSP, etc.) are naturally represented by independent filtersthat communicate over explicit data channels.In this paper, we demonstrate an end-to-end stream compilerthat attains robust multicore performance in the face of varying ap-plication characteristics. As benchmarks exhibit different amountsof task, data, and pipeline parallelism, we exploit all types of par-allelism in a unified manner in order to achieve this generality. Ourcompiler, which maps from t he StreamIt language to the 16-coreRaw architecture, attains a 11.2x mean speedup over a single-corebaseline, and a 1.84x speedup over our previous work.Categories and Subject Descriptors D.3.2 [Programming L an-guages]: Language Classifications–Data-flow languages; D.3.3[Programming Languages]: Language Constructs and Features–Concurrent programming structures; D.3.4 [Programming Lan-guages]: Processors–Compilers, OptimizationGeneral Terms Design, Languages, PerformanceKeywords coarse-grained dataflow, multicore, Raw, softwarepipelining, StreamIt, streams1. IntroductionAs centralized mi croprocessors are ceasing to scale effectively,multicore architectures are becomi ng the industry standard. For ex-ample, the IBM/Toshiba/Sony Cell processor has 9 cores [17], theSun Niagara has 8 cores [21], the RMI XLR732 has 8 cores [1], theIBM/Microsoft Xbox 360 CPU has 3 cores [4], and most vendorsare shipping dual-core chips. Cisco has described a next-generationnetwork processor containing 192 Tensilica Xtensa cores [14]. Thistrend has pushed the performance burden to the compiler, as fu-ture application-level performance gains depend on effective paral-lelization across the cores. Unfortunately, traditional programmingmodels such as C, C++ and FORTRAN are ill-suited to multicorearchitectures because they assume a single instruction stream and amonolithic memory. Extracting coarse-grained parallelism suitablefor multicore execution amounts to a heroic compiler analysis thatremains largely intractable.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profi t or commercial advantage and that copies bear this notice and the full citationon the fi rst page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specifi c permission and/or a fee.ASPLOS’06October 21–25, 2006, San Jose, California, USA.Copyrightc! 2006 ACM 1-59593-451-0/06/0010. . . $5.00.The stream programming paradigm offers a promising ap-proach for exposing parallelism suitable for multicore archi-tectures. Stream languages such as StreamIt [39], Brook [6],SPUR [42], Cg [27], Baker [9], and Spidle [10] are motivatednot only by trends in computer architecture, but also by trends inthe application space, as network, image, voice, and multimediaprograms are becoming only more prevalent. In the StreamIt lan-guage, a program i s represented as a set of autonomous actors thatcommunicate through FIFO data channels (see Figure 1). Duringprogram execution, actors fire repeatedly in a periodic schedule.As each actor has a separate program counter and an independentaddress space, all dependences between actors are made explicit bythe communication channels. Compilers can leverage this depen-dence information to orchestrate parallel execution.Despite the abundance of parallelism in stream programs, it isnonetheless a challenging problem to obtain an efficient mappingto a multicore architecture. Often the gains from parallel executioncan be overshadowed by the costs of communication and synchro-nization. In addition, not all parallelism has equal benefits, as thereis sometimes a critical path that can only be reduced by running cer-tain actors in parallel. Due to these concerns, it is critical to leveragethe right combination of task, data, and pipeline parallelism whileavoiding the hazards associated with each.Task parallelism refers to pairs of actors that are on differentparallel branches of the original stream graph, as written by theprogrammer. That is, the output of each actor never reaches theinput of the other. In stream programs, task parallelism reflectslogical parallelism in the underlying algorithm. It is easy to exploitby mapping each task to an independent processor and splittingor joining the data stream at the endpoints (see Figure 2b). Thehazards associated with task parallelism are the communication andsynchronization associated with the splits and joins. Al so, as thegranularity of task parallelism depends on the application (and theprogrammer), it is not sufficient as the only source of parallelism.Data parallelism refers to any actor that has no dependences be-tween one execution and the next. Such “stateless” actors1off erunlimited data parallelism, as different instances of the actor canbe spread across any number of computation units (see Figure 2c).However, while data parallelism is well-suited to vector machines,on coarse-grained multicore architectures it can introduce exces-sive communication overhead. Previous data-parallel streaming ar-chitectures have focused on designing a special memory hierarchyto support this communication [18]. However, data parallelism hasthe hazard of increasing buffering and latency, and the limitation ofbeing unable to parallelize actors with state.Pipeline parallelism applies to chains of producers and con-sumers that are directly connected in the stream graph. In our previ-1A stateless actor may still have read-only state.ous work [15], we exploited pipeline parallelism by mapping clus-ters of producers and consumers to different cores and using anon-chip network for direct communication between actors (see Fig-ure 2d). Compared to data parallelism, this approach offers reducedlatency, reduced buffering, and good locality. It does not introduceany extraneous communication, and it provides the ability to exe-cute any pair of


View Full Document

UCLA COMSCI 239 - gordon-asplos06

Download gordon-asplos06
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view gordon-asplos06 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view gordon-asplos06 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?