This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

EXOCHI: Architecture and Programming Environment forA Heterogeneous Multi-core Multithreaded SystemPerry H. Wang1, Jamison D. Collins1, Gautham N. Chinya1, Hong Jiang2,XinminTian3Milind Girkar3,NickY.Yang2, Guei-Yuan Lueh2, and Hong Wang1Microarchitecture Research Lab, Microprocessor Technology Labs, Intel Corporation1Graphics Architecture, Chipset Group, Intel Corporation2Intel Compiler Lab, Software Solutions Group, Intel Corporation3Contact: [email protected] mainstream microprocessors will likely integrate special-ized accelerators, such as GPUs, onto a single die to achieve bet-ter performance and power efficiency. However, it remains a keenchallenge to program such a heterogeneous multi-core platform,since these specialized accelerators feature ISAs and functional-ity that are significantly dif ferent from the general purpose CPUcores. In this paper, we present EXOCHI: (1) Exoskeleton Se-quencer (EXO), an architecture to represent heterogeneous accel-erators as ISA-based MIMD architecture resources, and a sharedvirtual memory heterogeneous multithreaded program executionmodel that tightly couples specialized accelerator cores with gen-eral purpose CPU cores, and (2) C for Heterog eneous Integration(CHI), an integrated C/C++ programming en vironment that sup-ports accelerator-specific inline assembly and domain-specific lan-guages. The CHI compiler extends the OpenMP pragma for hetero-geneous multithreading programming, and produces a single fat bi-nary with code sections corresponding to different instruction sets.The runtime can judiciously spread parallel computation across theheterogeneous cores to optimize performance and power.We have prototyped the EXO architecture on a physical het-erogeneous platform consisting of an IntelRCoreTM2 Duo pro-cessor and an 8-core 32-thread IntelRGraphics Media Accelera-tor X3000. In addition, we have implemented the CHI integratedprogramming environment with the IntelRC++ Compiler, run-time toolset, and debugger. On the EXO prototype system, we haveenhanced a suite of production-quality media kernels for videoand image processing to utilize the accelerator through the CHIprogramming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone.Categories and Subject Descriptors C.1.4 [Processor Architec-tures]: Parallel Architectures; D.3.4 [Pr ogramming Languages]:Processors—CompilersGeneral Terms Performance, Design, LanguagesKeyw ords Heterogeneous multi-cores, GPU, OpenMPPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.PLDI’07June 11–13, 2007, San Diego, California, USA.Copyrightc 2007 ACM 978-1-59593-633-2/07/0006. . . $5.001. IntroductionThe relentless pace of Moore’s Law will lead to mainstream multi-core microprocessor designs with extensive on-die integration ofa large number of cores [14]. Fundamentally, to scale multi-coreprocessor designs to incorporate a large number of cores, ultra lowEPI (Energy Per Instruction) cores are essential [10]. For example,to achieve a 20X improvement (e.g., from 5GOPS to 100GOPS) inthroughput performance while staying below the power envelopeof 150W, the building-block cores must have an average EPI ofapproximately 1nJ. The EPI for the IntelRCoreTM2 Duo processorcore [31] is approximately 10nJ while the EPI for the 8-core 32-thread IntelRGraphics Media Accelerator X3000 [15] is only0.3nJ. One approach to improving EPI by an order of magnitudeis through heterogeneous multi-core design, in which some coresvary in functionality, instruction set (ISA), performance, power,and energy efficiency [2, 17]. The key challenge then becomesho w to accomplish such heterogeneous integration and achievehigh performance while still maintaining the look-and-feel of theclassic mainstream IA32-based programming models and softwareecosystem.In this paper, we present EXOCHI: Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous accelerators asISA-based MIMD architectural resources, and C for Heteroge-neous Integration (CHI), a programming environment that sup-ports tightly-coupled integration of heterogeneous cores. The EXOarchitecture supports the familiar POSIX shared virtual memorymultithreaded programming model for heterogeneous cores. Likeapplication-managed sequencers in the Multiple Instruction StreamProcessor (MISP) architecture [11], the non-IA32 cores are archi-tecturally exposed to the programmer as a ne w form of sequencerresource. They can be regarded essentially as application-levelMIMD functional units on which user-level threads, or shreds,en-coded in the accelerator-specific ISA can execute. Having a sharedvirtual address space between the IA32 sequencer and acceleratorsequencers facilitates code and data sharing and harmonizes coop-eration between the concurrent shreds of different ISAs.The CHI integrated programming environment allows an ap-plication developer to inline blocks of accelerator-specific assem-bly or domain-specific language with traditional C/C++ code. TheCHI compiler produces a single fat binary consisting of executablecode sections corresponding to the different ISAs. CHI further ex-tends the OpenMP pragmas [25, 26, 30] to allow the program-mer to express thread-level parallelism by demarcating parallelregions of code targeting non-IA32 accelerators. The CHI exten-sions to OpenMP support both fork-join and producer-consumerparallelism among the accelerator shreds and between the IA32156shreds and the accelerator shreds. The CHI runtime can judiciouslyspread the shreds across the heterogeneous sequencers dynamicallyto maximize throughput performance while minimizing power.This paper makes the following contributions:•We describe the EXO architecture and the CHI programmingenvironment that support shared virtual memory multithreadedprograms for a heterogeneous multi-core processor.•We detail a heterogeneous multi-core prototype of the EXO ar-chitecture consisting of an IntelRCoreTM2 Duo [31] proces-sor and an 8-core 32-thread IntelRGraphics Media Accelerator(GMA) X3000 [15].•We present an implementation of the CHI programming


View Full Document

UCLA COMSCI 239 - wang

Download wang
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view wang and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view wang 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?