Berkeley COMPSCI 258 - SYNERGISTIC PROCESSING IN CELL’S MULTICORE ARCHITECTURE - D1045455

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> SYNERGISTIC PROCESSING IN CELL’S MULTICORE ARCHITECTURE

DOC PREVIEW

Berkeley COMPSCI 258 - SYNERGISTIC PROCESSING IN CELL’S MULTICORE ARCHITECTURE

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

10When IBM, Sony, and Toshibalaunched the Cell project1in 2000, the designgoal was to improve performance an order ofmagnitude over that of desktop systems ship-ping in 2005. To meet that goal, designers hadto optimize performance against area, power,volume, and cost, but clearly single-coredesigns offered diminishing returns on invest-ment.1-3If increased efficiency was the over-riding concern, legacy architectures, whichtypically incur a big overhead per data opera-tion, would not suffice.Thus, the design strategy was to exploitarchitecture innovation at all levels directedat increasing efficiency to deliver the most per-formance per area invested, reduce the areaper core, and have more cores in a given chiparea. In this way, the design would exploitapplication parallelism while supportingestablished application models and therebyensure good programmability as well as pro-grammer efficiency. The result was the Cell Broadband EngineArchitecture, which is based on heterogeneouschip multiprocessing. Its first implementationis the Cell Broadband Engine (Cell BE). TheCell BE supports scalar and single-instruction,multiple-data (SIMD) execution equally welland provides a high-performance multithread-ed execution environment for all applications.The streamlined, data-processing-orientedarchitecture enabled a design with smaller coresand thus more cores on a chip.4This translatesto improved performance for all programs withthread-level parallelism regardless of their abil-ity to exploit data-level parallelism.One of the key architecture features thatenable the Cell BE’s processing power is thesynergistic processor unit (SPU)—a data-paral-lel processing engine aimed at providing par-allelism at all abstraction levels. Data-parallelinstructions support data-level parallelism,whereas having multiple SPUs on a chip sup-ports thread-level parallelism.The SPU architecture is based on perva-sively data parallel computing (PDPC), theaim of which is to architect and exploit widedata paths throughout the system. Theprocessor then performs both scalar and data-parallel SIMD execution on these wide datapaths, eliminating the overhead from addi-tional issues slots, separate pipelines, and thecontrol complexity of separate scalar units.The processor also uses wide data paths todeliver instructions from memory to the exe-cution units. Michael Gschwind H. Peter Hofstee Brian Flachs Martin HopkinsIBMYukio WatanabeToshibaTakeshi YamazakiSony ComputerEntertainmentEIGHT SYNERGISTIC PROCESSOR UNITS ENABLE THECELLBROADBANDENGINE’S BREAKTHROUGH PERFORMANCE. THESPU ARCHITECTUREIMPLEMENTS A NOVEL, PERVASIVELY DATA-PARALLEL ARCHITECTURECOMBINING SCALAR ANDSIMD PROCESSING ON A WIDE DATA PATH. A LARGENUMBER OFSPUS PER CHIP PROVIDE HIGH THREAD-LEVEL PARALLELISM.SYNERGISTICPROCESSING INCELL’SMULTICOREARCHITECTUREPublished by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEEOverview of the Cell Broadband EnginearchitectureAs Figure 1 illustrates, the Cell BE imple-ments a single-chip multiprocessor with nineprocessors operating on a shared, coherent sys-tem memory. The function of the processorelements is specialized into two types: thePower processor element (PPE) is optimized forcontrol tasks and the eight synergistic processorelements (SPEs) provide an execution envi-ronment optimized for data processing. Fig-ure 2 is a die photo of the Cell BE.The design goals of the SPE and its archi-tectural specification were to optimize for alow complexity, low area implementation. The PPE is built on IBM’s 64-bit PowerArchitecture with 128-bit vector media exten-sions5and a two-level on-chip cache hierar-chy. It is fully compliant with the 64-bit PowerArchitecture specification and can run 32-bitand 64-bit operating systems and applications. The SPEs are independent processors, eachrunning an independent application thread.The SPE design is optimized for computa-tion-intensive applications. Each SPE includesa private local store for efficient instructionand data access, but also has full access to thecoherent shared memory, including the mem-ory-mapped I/O space. Both types of processor cores share accessto a common address space, which includesmain memory, and address ranges corre-sponding to each SPE’s local store, control reg-isters, and I/O devices. 11MARCH–APRIL 2006Flex I/OMemoryinterfacecontrollerBusinterfacecontrollerDual XDR 32 bytes/cycle 16 bytes/cycleElement interconnect bus (up to 96 bytes/cycle)16 bytes/cycleSynergistic processor elementsPowerprocessorelementPowerprocessor unitPowerexecutionunitL1cacheL2 cacheLocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMF16 bytes/cycle 16 bytes/cycle (2x) LocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMF64-bit Power Architecture with vector media extensionsFigure 1. Cell system architecture. The Cell Broadband Engine Architecture integrates a Power processor element (PPE) andeight synergistic processor elements (SPEs) in a unified system architecture. The PPE is based on the 64-bit Power Architec-ture with vector media extensions and provides common system functions, while the SPEs perform data-intensive process-ing. The element interconnect bus connects the processor elements with a high-performance communication subsystem.Synergistic processingThe PPE and SPEs are highly integrated.The PPE provides common control functions,runs the operating system, and provides appli-cation control, while the SPEs provide thebulk of the application performance. The PPEand SPEs share address translation and virtu-al memory architecture, and provide supportfor virtualization and dynamic system parti-tioning. They also share system page tablesand system functions such as interrupt pre-sentation. Finally, they share data type formatsand operation semantics to allow efficient datasharing among them.Each SPE consists of the SPU and the syn-ergistic memory flow (SMF) controller. TheSMF controller moves data and performs syn-chronization in parallel to SPU processing andimplements the interface to the element inter-connect bus, which provides the Cell BE witha modular, scalable integration point.Design driversFor both the architecture and microarchi-tecture, our goal was not to build the highestsingle-core performance execution engine, butto deliver the most performance per areainvested, reduce the area per core, and increasethe number of cores (thread contexts) avail-able in a given chip area. The design

View Full Document