CMU CS 15740 - Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmark - D2581049

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15740> Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmark

DOC PREVIEW

CMU CS 15740 - Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmark

School name Carnegie Mellon University

Course Cs 15740- Computer Architecture

Pages 11

This preview shows page 1-2-3-4 out of 11 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 11 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Vector Vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks Christoforos Kozyrakis Electrical Engineering Department Stanford University christosOee, stanford, edu David Patterson Computer Science Division University of California at Berkeley pattrsn@cs, berkeley, edu Abstract Multimedia processing on embedded devices requires an architecture that leads to high performance, low power con- sumption, reduced design complexity, and small code size. In this paper, we use EEMBC, an industrial benchmark suite, to compare the VIRAM vector architecture to super- scalar and VLIW processors for embedded multimedia ap- plications. The comparison covers the VIRAM instruction set, vectorizing compiler, and the prototype chip that inte- grates a vector processor with DRAM main memory. We demonstrate that executable code for VIRAM is up to 10 times smaller than VLIW code and comparable to x86 CISC code. The simple, cache-less VIRAM chip is 2 times faster than a 4-way superscalar RISC processor that uses a 5 times faster clock frequency and consumes 10 times more power. VIRAM is also 10 times faster than cache-based VLIW processors. Even after manual optimization of the VLIW code and insertion of SIMD and DSP instructions, the single-issue VlRAM processor is 60%faster than 5-way to 8-way VLIW designs. 1 Introduction The exponentially increasing performance and general- ity of superscalar processors has lead many to believe that vector architectures are doomed to extinction. Even in the supercomputing domain, the traditional application of vec- tor processors, it is widely considered that interconnecting superscalar processors into large-scale MPP systems is the most promising approach [4]. Nevertheless, vector archi- tectures provide us with frequent reminders of their capabil- ities. The recently announced Japanese Earth Simulator, a supercomputer based on NEC SX-6 vector processors, pro- vides 5 times the performance with half the number of nodes of ASCI White, the most powerful supercomputer based on superscalar technology. Vector processors remain the most effective way to exploit data-parallel applications [20]. This paper studies the efficiency of vector architectures for the emerging computing domain of multimedia pro- grams running on embedded systems. Multimedia pro- grams such as video, speech recognition, and 3D graphics, constitute the fastest growing class of applications [5]. They require real-time performance guarantees for data-parallel tasks that operate on narrow numbers with limited tem- poral locality [6]. Embedded systems include entertain- ment devices, such as set-top-boxes and game consoles, and portable electronics, such as PDAs and cellular phones. They call for low power consumption, small code size, and reduced design and programming complexity in order to meet the cost and time-to-market requirements of consumer electronics. The complexity, power consumption, and lack of explicit support for data-level parallelism suggest that su- perscalar processors are not necessarily a suitable approach for embedded multimedia processing. To prove that vector architectures meet the requirements of embedded media-processing, we evaluate the VIRAM vector architecture with the EEMBC benchmarks, an indus- trial suite for embedded systems. Our evaluation covers all three components of VIRAM: the instruction set, the vec- torizing compiler, and the processor microarchitecture. We show that the compiler can extract a high degree of data- level parallelism from media tasks described in C and can express it with vector instructions. The VIRAM code is sig- nificantly smaller than code for RISC and VLIW architec- tures and is comparable to that for x86 CISC processors. We describe a simple, low power, prototype chip that in- tegrates the VIRAM architecture with embedded DRAM. The cache-less vector processor is 2 times faster than a 4- way superscalar processors running at a 5 times higher clock frequency. Despite issuing a single instruction per cycle, it is also 10 times faster than 5-way to 8-way VLIW designs. We demonstrate that the vector processor provides perfor- mance advantages for both highly vectorizable benchmarks and partially vectorizable tasks with short vectors. The rest of this paper is structured as follows. Section 2 summarizes the basic features of the VIRAM architec- 283 0-7695-1859-1/02 $17.00 © 2002 1EEEScalar Single-issue 64-bit MIPS pipeline Core 8K/8K direct-mapped L1 I/D caches ture. Section 3 describes the EEMBC embedded bench- marks. Section 4 evaluates the vectorizing compiler and the use of the vector instruction set. It also presents a code size comparison between RISC, CISC, VLIW, and vector archi- tectures. Section 5 proceeds with a microarchitecture eval- uation in terms of performance, power consumption, design complexity, and scalability. Section 6 presents related work and Section 7 concludes the paper. 2 Vector Architecture for Multimedia In this section, we provide an overview of the three com- ponents of the VIRAM architecture: the instructions set, the prototype processor chip, and the vectorizing compiler. 2.1 Instruction Set Overview VIRAM is a complete, load-store, vector instruction set defined as a coprocessor extension to the MIPS architecture. The vector architecture state includes a vector register file with 32 entries that can store integer or floating-point ele- ments, a 16-entry flag register file that contains vectors with single-bit elements, and a few scalar registers for control values and memory addresses. The instruction set contains integer and floating-point arithmetic instructions that oper- ate on vectors stored in the register file, as well as logical functions and operations such as population count that use the flag registers. Vector load and store instructions support the three common access patterns: unit stride, strided, and indexed. Overall, VIRAM introduces 90 unique instruc- tions, which, due to variations, consume 660 opcodes in the coprocessor 2 space of the MIPS architecture. To enable the vectorization of multimedia applications, VIRAM includes a number of media-specific enhance- ments. The elements in the vector registers can be 64, 32, or 16 bits wide. Multiple narrow elements are placed

View Full Document