IEEE JOURNAL OF SOLID STATE CIRCUITS VOL 41 NO 1 JANUARY 2006 63 The Microarchitecture of the Synergistic Processor for a Cell Processor Brian Flachs Shigehiro Asano Member IEEE Sang H Dhong Fellow IEEE H Peter Hofstee Member IEEE Gilles Gervais Roy Kim Tien Le Peichun Liu Jens Leenstra John Liberty Brad Michael Hwa Joon Oh Silvia Melitta Mueller Osamu Takahashi Member IEEE A Hatakeyama Yukio Watanabe Naoka Yano Daniel A Brokenshire Mohammad Peyravian Vandung To and Eiji Iwata Abstract This paper describes an 11 FO4 streaming data processor in the IBM 90 nm SOI low k process The dual issue four way SIMD processor emphasizes achievable performance per area and power Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density The design minimizes instruction latency while providing for fine grain clock control to reduce power Index Terms Cell DSP RISC SIMD SPE SPU I INTRODUCTION I NCREASING thread level parallelism data bandwidth memory latency and leakage current are important drivers for new processor designs such as Cell Today s media rich application software is often characterized by multiple light weight threads and software pipelines This trend in software design favors processors that utilize these threads to drive the improved data bandwidths over processors designed to accelerate a single thread of execution by taking advantage of instruction level parallelism Memory latency is a key limiter to processor performance Modern processors can lose up to 4000 instruction slots while they wait for data from main memory Previous designs emphasize large caches and reorder buffers first to reduce the average latency and second to maintain instruction throughput while waiting for data from cache misses However these hardware structures have difficulty scaling to the sizes required by the large data structures utilized by media rich software Transistors oxides are now a few atomic levels thick and the channels are extremely narrow These features are very good for improving transistor performance and increasing transistor density but tend to increase leakage current As processor performance becomes power limited leakage current Manuscript received April 15 2005 revised August 31 2005 B Flachs S H Dhong H P Hofstee G Gervais R Kim T Le P Liu J Liberty B Michael H J Oh O Takahashi D A Brokenshire and V To are with the IBM Systems and Technology Group Austin TX 78758 USA e mail flachs us ibm com S Asano and Y Watanabe are with Toshiba America Electronic Components Austin TX 78717 USA N Yano is with the Broadband System LSI Development Center Semiconductor Company Toshiba Corporation Kawasaki Japan J Leenstra and S M Mueller are with the IBM Entwicklung GmbH Boeblingen 71032 Germany A Hatakeyama and E Iwata are with Sony Computer Entertainment Austin TX 78717 USA M Peyravian is with IBM Microelectronics Research Triangle Park NC 27709 USA Digital Object Identifier 10 1109 JSSC 2005 859332 becomes an important performance issue Since leakage is proportional to area processor designs need to extract more performance per transistor II ARCHITECTURE The Cell processor is a heterogeneous shared memory multiprocessor 2 It features a multi threaded 64 bit POWER processing element PPE and eight synergistic processing elements SPE Performance per transistor is the motivation for heterogeneity Software can be divided into general purpose computing threads operating system tasks and streaming media threads and targeted to a processing core customized for those tasks For example PPE is responsible for running the operating system and coordinating the flow of the data processing threads through the SPEs This differentiation allows the architectures and implementations of the PPE and SPE to be optimized for their respective workloads and enables significant improvements in performance per transistor The synergistic processor element SPE is the first implementation of a new processor architecture designed to accelerate media and streaming workloads The architecture aims to improve the effective memory bandwidth achievable by applications by improving the degree to which software can tolerate memory latency SPE provides processing power needed by streaming and media workloads through four way SIMD operations dual issue and high frequency Area and power efficiency are important enablers for multicore designs that take advantage of parallelism in applications where performance is power limited Every design choice must trade off the performance a prospective feature would bring versus the prospect of omitting the feature and devoting the area and power toward higher clock frequency or more SPE cores per Cell processor chip Power efficiency drives a desire to replace event and status polling performed by software during synchronization with synchronization mechanisms that allow for low power waiting Fig 1 is a diagram of the SPE architecture s major entities and their relationships Local store is a private memory for SPE instructions and data The synergistic processing unit SPU core is a processor than runs instructions from the local store and can read or write the local store with its load and store instructions The direct memory access DMA unit transfers data between local store and system memory The DMA unit is programmable by SPU software via the channel unit The channel unit is a message passing interface between the SPU core and 0018 9200 20 00 2006 IEEE 64 the DMA unit and the rest of the Cell processing system The channel unit is accessed by the SPE software through channel access instructions The SPU core is a SIMD RISC style processor All instructions are encoded in 32 bit fixed length instruction formats and there are no hard to pipeline instructions SPU features 128 general purpose registers These registers are used by both floating point and integer instructions The shared register file allows the highest level of performance for various workloads with the smallest number of registers 128 registers allow for loop unrolling which is necessary to fill functional unit pipelines with independent instructions Most instructions operate on 127 bit wide data For example the floating point multiply add instruction operates on vectors of four 32 bit single precision floating point values Some instructions such as floating point multiply add consume three register operands and produce a register result SPE includes instructions
View Full Document
Unlocking...