10When IBM, Sony, and Toshibalaunched the Cell project1in 2000, the designgoal was to improve performance an order ofmagnitude over that of desktop systems ship-ping in 2005. To meet that goal, designers hadto optimize performance against area, power,volume, and cost, but clearly single-coredesigns offered diminishing returns on invest-ment.1-3If increased efficiency was the over-riding concern, legacy architectures, whichtypically incur a big overhead per data opera-tion, would not suffice.Thus, the design strategy was to exploitarchitecture innovation at all levels directedat increasing efficiency to deliver the most per-formance per area invested, reduce the areaper core, and have more cores in a given chiparea. In this way, the design would exploitapplication parallelism while supportingestablished application models and therebyensure good programmability as well as pro-grammer efficiency. The result was the Cell Broadband EngineArchitecture, which is based on heterogeneouschip multiprocessing. Its first implementationis the Cell Broadband Engine (Cell BE). TheCell BE supports scalar and single-instruction,multiple-data (SIMD) execution equally welland provides a high-performance multithread-ed execution environment for all applications.The streamlined, data-processing-orientedarchitecture enabled a design with smaller coresand thus more cores on a chip.4This translatesto improved performance for all programs withthread-level parallelism regardless of their abil-ity to exploit data-level parallelism.One of the key architecture features thatenable the Cell BE’s processing power is thesynergistic processor unit (SPU)—a data-paral-lel processing engine aimed at providing par-allelism at all abstraction levels. Data-parallelinstructions support data-level parallelism,whereas having multiple SPUs on a chip sup-ports thread-level parallelism.The SPU architecture is based on perva-sively data parallel computing (PDPC), theaim of which is to architect and exploit widedata paths throughout the system. Theprocessor then performs both scalar and data-parallel SIMD execution on these wide datapaths, eliminating the overhead from addi-tional issues slots, separate pipelines, and thecontrol complexity of separate scalar units.The processor also uses wide data paths todeliver instructions from memory to the exe-cution units. Michael Gschwind H. Peter Hofstee Brian Flachs Martin HopkinsIBMYukio WatanabeToshibaTakeshi YamazakiSony ComputerEntertainmentEIGHT SYNERGISTIC PROCESSOR UNITS ENABLE THECELLBROADBANDENGINE’S BREAKTHROUGH PERFORMANCE. THESPU ARCHITECTUREIMPLEMENTS A NOVEL, PERVASIVELY DATA-PARALLEL ARCHITECTURECOMBINING SCALAR ANDSIMD PROCESSING ON A WIDE DATA PATH. A LARGENUMBER OFSPUS PER CHIP PROVIDE HIGH THREAD-LEVEL PARALLELISM.SYNERGISTICPROCESSING INCELL’SMULTICOREARCHITECTUREPublished by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEEOverview of the Cell Broadband EnginearchitectureAs Figure 1 illustrates, the Cell BE imple-ments a single-chip multiprocessor with nineprocessors operating on a shared, coherent sys-tem memory. The function of the processorelements is specialized into two types: thePower processor element (PPE) is optimized forcontrol tasks and the eight synergistic processorelements (SPEs) provide an execution envi-ronment optimized for data processing. Fig-ure 2 is a die photo of the Cell BE.The design goals of the SPE and its archi-tectural specification were to optimize for alow complexity, low area implementation. The PPE is built on IBM’s 64-bit PowerArchitecture with 128-bit vector media exten-sions5and a two-level on-chip cache hierar-chy. It is fully compliant with the 64-bit PowerArchitecture specification and can run 32-bitand 64-bit operating systems and applications. The SPEs are independent processors, eachrunning an independent application thread.The SPE design is optimized for computa-tion-intensive applications. Each SPE includesa private local store for efficient instructionand data access, but also has full access to thecoherent shared memory, including the mem-ory-mapped I/O space. Both types of processor cores share accessto a common address space, which includesmain memory, and address ranges corre-sponding to each SPE’s local store, control reg-isters, and I/O devices. 11MARCH–APRIL 2006Flex I/OMemoryinterfacecontrollerBusinterfacecontrollerDual XDR 32 bytes/cycle 16 bytes/cycleElement interconnect bus (up to 96 bytes/cycle)16 bytes/cycleSynergistic processor elementsPowerprocessorelementPowerprocessor unitPowerexecutionunitL1cacheL2 cacheLocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMF16 bytes/cycle 16 bytes/cycle (2x) LocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMFLocalstoreSXUSPUSMF64-bit Power Architecture with vector media extensionsFigure 1. Cell system architecture. The Cell Broadband Engine Architecture integrates a Power processor element (PPE) andeight synergistic processor elements (SPEs) in a unified system architecture. The PPE is based on the 64-bit Power Architec-ture with vector media extensions and provides common system functions, while the SPEs perform data-intensive process-ing. The element interconnect bus connects the processor elements with a high-performance communication subsystem.Synergistic processingThe PPE and SPEs are highly integrated.The PPE provides common control functions,runs the operating system, and provides appli-cation control, while the SPEs provide thebulk of the application performance. The PPEand SPEs share address translation and virtu-al memory architecture, and provide supportfor virtualization and dynamic system parti-tioning. They also share system page tablesand system functions such as interrupt pre-sentation. Finally, they share data type formatsand operation semantics to allow efficient datasharing among them.Each SPE consists of the SPU and the syn-ergistic memory flow (SMF) controller. TheSMF controller moves data and performs syn-chronization in parallel to SPU processing andimplements the interface to the element inter-connect bus, which provides the Cell BE witha modular, scalable integration point.Design driversFor both the architecture and microarchi-tecture, our goal was not to build the highestsingle-core performance execution engine, butto deliver the most performance per areainvested, reduce the area per core, and increasethe number of cores (thread contexts) avail-able in a given chip area. The design
View Full Document