Vector Vs Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks Christoforos Kozyrakis Electrical E n g i n e e r i n g D e p a r t m e n t S t a n f o r d University D a v i d Patterson Computer Science Division University o f C a l i f o r n i a at B e r k e l e y christosOee stanford edu pattrsn cs berkeley edu Abstract This paper studies the efficiency of vector architectures for the emerging computing domain of multimedia programs running on embedded systems Multimedia programs such as video speech recognition and 3D graphics constitute the fastest growing class of applications 5 They require real time performance guarantees for data parallel tasks that operate on narrow numbers with limited temporal locality 6 Embedded systems include entertainment devices such as set top boxes and game consoles and portable electronics such as PDAs and cellular phones They call for low power consumption small code size and reduced design and programming complexity in order to meet the cost and time to market requirements of consumer electronics The complexity power consumption and lack of explicit support for data level parallelism suggest that superscalar processors are not necessarily a suitable approach for embedded multimedia processing Multimedia processing on embedded devices requires an architecture that leads to high performance low power consumption reduced design complexity and small code size In this paper we use EEMBC an industrial benchmark suite to compare the VIRAM vector architecture to superscalar and VLIW processors f o r embedded multimedia applications The comparison covers the VIRAM instruction set vectorizing compiler and the prototype chip that integrates a vector processor with DRAM main memory We demonstrate that executable code f o r VIRAM is up to 10 times smaller than VLIW code and comparable to x86 CISC code The simple cache less VIRAM chip is 2 times faster than a 4 way superscalar RISC processor that uses a 5 times faster clock frequency and consumes 10 times more power VIRAM is also 10 times faster than cache based VLIW processors Even after manual optimization o f the VLIW code and insertion o f SIMD and DSP instructions the single issue VlRAM processor is 60 faster than 5 way to 8 way VLIW designs 1 To prove that vector architectures meet the requirements of embedded media processing we evaluate the VIRAM vector architecture with the EEMBC benchmarks an industrial suite for embedded systems Our evaluation covers all three components of VIRAM the instruction set the vectorizing compiler and the processor microarchitecture We show that the compiler can extract a high degree of datalevel parallelism from media tasks described in C and can express it with vector instructions The VIRAM code is significantly smaller than code for RISC and VLIW architectures and is comparable to that for x86 CISC processors We describe a simple low power prototype chip that integrates the VIRAM architecture with embedded DRAM The cache less vector processor is 2 times faster than a 4way superscalar processors running at a 5 times higher clock frequency Despite issuing a single instruction per cycle it is also 10 times faster than 5 way to 8 way VLIW designs We demonstrate that the vector processor provides performance advantages for both highly vectorizable benchmarks and partially vectorizable tasks with short vectors Introduction The exponentially increasing performance and generality of superscalar processors has lead many to believe that vector architectures are doomed to extinction Even in the supercomputing domain the traditional application of vector processors it is widely considered that interconnecting superscalar processors into large scale MPP systems is the most promising approach 4 Nevertheless vector architectures provide us with frequent reminders of their capabilities The recently announced Japanese Earth Simulator a supercomputer based on NEC SX 6 vector processors provides 5 times the performance with half the number of nodes of ASCI White the most powerful supercomputer based on superscalar technology Vector processors remain the most effective way to exploit data parallel applications 20 0 7695 1859 1 02 17 00 2002 1EEE The rest of this paper is structured as follows Section 2 summarizes the basic features of the VIRAM architec 283 ture Section 3 describes the EEMBC embedded benchmarks Section 4 evaluates the vectorizing compiler and the use of the vector instruction set It also presents a code size comparison between RISC CISC VLIW and vector architectures Section 5 proceeds with a microarchitecture evaluation in terms of performance power consumption design complexity and scalability Section 6 presents related work and Section 7 concludes the paper Scalar Core Vector Coprocessor 2 Technology Memory System Vector Architecture for Multimedia Transistors In this section we provide an overview of the three components of the VIRAM architecture the instructions set the prototype processor chip and the vectorizing compiler 2 1 Clock Frequency Power Dissipation Peak Performance Instruction Set Overview VIRAM is a complete load store vector instruction set defined as a coprocessor extension to the MIPS architecture The vector architecture state includes a vector register file with 32 entries that can store integer or floating point elements a 16 entry flag register file that contains vectors with single bit elements and a few scalar registers for control values and memory addresses The instruction set contains integer and floating point arithmetic instructions that operate on vectors stored in the register file as well as logical functions and operations such as population count that use the flag registers Vector load and store instructions support the three common access patterns unit stride strided and indexed Overall VIRAM introduces 90 unique instructions which due to variations consume 660 opcodes in the coprocessor 2 space of the MIPS architecture To enable the vectorization of multimedia applications VIRAM includes a number of media specific enhancements The elements in the vector registers can be 64 32 or 16 bits wide Multiple narrow elements are placed in the storage location for one wide element Similarly each 64 bit datapath is partitioned in order to execute multiple narrower element operations in parallel Instead of specifying the element and operation width in the instruction opcode we use a control register which is typically set
View Full Document
Unlocking...