DOC PREVIEW
UH COSC 6385 - The IBM Cell, Intel Larrabee and Nvidia G80 processors

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Edgar GabrielCOSC 6385 Computer Architecture - Multi-Processors (II)The IBM Cell, Intel Larrabee and Nvidia G80 processorsEdgar GabrielFall 2009COSC 6385 – Computer ArchitectureEdgar GabrielReferences• Intel Larrabee:[1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, P. Hanrahan:“Larrabee: a many-core x86 architecture for visual computing”,ACM Trans. Graph., Vol. 27, No. 3. (August 2008), pp. 1-15.http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf• IBM Cell processor:[2] C. R. Johns, D. A. Brokenshire“Introductioon to the Cell Broadband Engine Architecture”,IBM Journal of Research and Development, vol. 51, no. 5, pp. 503-519http://www.research.ibm.com/journal/rd/515/johns.pdf[3] M. Kistler, M. Perrone, F. Petrini, “Cell Multiprocessor Communication Network: Built for Speed” IEEE Micro, vol. 26, no. 3, pp .10-23ttp://hpc.pnl.gov/people/fabrizio/papers/ieeemicro-cell.pdf• Nvidia G80[4] Scott Wasson, Nvidia GeForce 8800 graphics processor”http://techreport.com/articles.x/11211/1[5] Peter N. Glaskowsky, “NVIDIA’s Fermi: The First CompleteGPU Computing Architecture”, http://www.nvidia.com/content/ PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf2COSC 6385 – Computer ArchitectureEdgar GabrielLarrabee Motivation• Comparison of two architectures with the same number of transistors– Half the performance of a single stream for the simplified core– 40x increase for multi-stream executions2 out-of-order cores10 in-order coresInstruction issue 4 2VPU per core 4-wide SSE 16-wideL2 cache size 4 MB 4 MBSingle stream 4 per clock 2 per clockVector throughput8 per clock 160 per clockCOSC 6385 – Computer ArchitectureEdgar GabrielLarrabee Overview• Many-core visual computing architecture• Based on x86 CPU cores– Extended version of the regular x86 instruction set– Supports subroutines and page faulting• Number of x86 cores can vary depending on the implementation and processor version• Fixed functional units for texture filtering– Other graphical operations such as rasterization or post-shader blending done in software3COSC 6385 – Computer ArchitectureEdgar GabrielLarrabee Overview (II)Image Source: [1]COSC 6385 – Computer ArchitectureEdgar GabrielOverview of a Larrabee Core (I)Image Source: [1]4COSC 6385 – Computer ArchitectureEdgar GabrielOverview of a Larrabee Core (I)• x86 core derived from the Pentium processor– No out-of-order execution• Standard Pentium instruction set with the addition of – 64 bit instructions– Instructions for pre-fetching data into L1 and L2 cache– Support for 4 simultaneous threads, separate registers for each thread• Each core is augmented with a wide vector processor (VPU)• 32kb L1 Instruction cache, 32 kb L1 Data Cache• 256 KB of ‘local subset’ of the L2 cache– Coherent L2 cache across all coresCOSC 6385 – Computer ArchitectureEdgar GabrielVector Processing Unit in Larrabee• 16-wide VPU executing integer, single- and double precision floating point operations• VPU supports gather-scatter operations– The 16 elements are loaded or can be stored from up to 16 different addresses• Support for predicated instructions using a mask control register (if-then-else statements)5COSC 6385 – Computer ArchitectureEdgar GabrielInter-Processor Ring Network• Bi-directional ring network • 512 bits-wide per direction• Routing decisions done before injecting message into the networkCOSC 6385 – Computer ArchitectureEdgar GabrielLarrabee Programming Models• Most application can be executed without modification due to the full support of the x86 instruction set• Support for POSIX threads to create multiple threads– API extended by thread affinity parameters• Recompiling code with Larrabee’s native compiler will generate automatically the codes to use the VPUs.• Alternative parallel approaches– Intel threading building blocks– Larrabee specific OpenMP directives6COSC 6385 – Computer ArchitectureEdgar GabrielLarrabee PerformanceImage Source: [1]COSC 6385 – Computer ArchitectureEdgar GabrielIBM Cell Overview (I)• Cell Broadband Architecture (CBEA) defined by a consortium of IBM, Sony, and Toshiba• Originally targeting the multi-media industry– E.g. Playstation 3, Toshiba HDTV, etc.• Sold as regular compute-blades also by IBM – IBM QS20, QS21, QS22• Main idea: heterogeneous microprocessor consisting of – one (or more) general purpose processor element (PPE) and – (one or) more synergistic processor elements (SPEs)7COSC 6385 – Computer ArchitectureEdgar GabrielCell Architecture block diagramImage Source: [2]COSC 6385 – Computer ArchitectureEdgar Gabriel• Two generations available so far:– Cell BE: • 204.8 GFLOPS single precision peak performance• 14.6 GFLOPS double precision peak performance– PowerXCell 8i (2008): • 204.8 GFLOPS single precision peak performance• 102.4 GFLOPS double precision peak performance– Both have 1 PPE and 8 SPEs8COSC 6385 – Computer ArchitectureEdgar GabrielGeneral Purpose Processor (PPE)• Based on the IBM PowerPC processor– Supports multiple simultaneous operating environments (virtualization)– E.g. can execute an instance of a real-time operating system and an instance of a non-real-time operating system• Performs management and application control functionsCOSC 6385 – Computer ArchitectureEdgar GabrielSynergistic Processor Element (SPE)• SIMD processor used for offloading compute-intensive, data parallel operations from the PPE• Each SPE has its own local storage and can access data only from the local storage– Current versions of the Cell processors: 256k local storage• The local storage is connected to the main memory through a Memory Flow Controller (MFC)– MFC moves data from main memory to local storage or between two SPEs.9COSC 6385 – Computer ArchitectureEdgar GabrielMFC commandsImage Source: [2]COSC 6385 – Computer ArchitectureEdgar GabrielSynergistic Processor Element (SPE) (II)• Each SPE has 128 registers• Each register is 128 bits wide which can be used to hold– Sixteen 8-bit integers or– Eight 16-bit integers or– Four 32-bit integers or single precision floating-point numbers– Two 64-bit integers or double precision floating point numbers• Most instructions supported by the synergistic


View Full Document

UH COSC 6385 - The IBM Cell, Intel Larrabee and Nvidia G80 processors

Download The IBM Cell, Intel Larrabee and Nvidia G80 processors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The IBM Cell, Intel Larrabee and Nvidia G80 processors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The IBM Cell, Intel Larrabee and Nvidia G80 processors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?