Unformatted text preview:

EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors   Thursday, April 3rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3rd, 2003 Discussion Leaders: Nuwan Jayasena, Suzy Rivoire Notetakers: Paul Wang Lee, Wajahat Qadeer 1. Introduction Some applications, especially many multimedia applications, have a high degree of inherent data level parallelism (DLP). The two papers studied in this lecture describe two processor architectures that exploit this DLP to efficiently run these classes of applications. Both the VIRAM and Imagine papers point out that superscalar architectures are poorly suited to the demands of multimedia applications. These applications tend to be computationally intensive and highly data parallel with little global temporal locality, while demanding good performance, low cost, and low power. Superscalar architectures, which expend a great deal of area and power in inferring parallelism and using reactive caches, are suboptimal for these applications; rather, both papers propose architectures with explicit support for data parallelism and a more proactively managed memory hierarchy. 2. VIRAM [1] This paper addresses a mismatch between traditional superscalar and VLIW processors and the characteristics and demands of embedded multimedia applications. These highly data-parallel applications demand low complexity, low cost, and low power consumption from hardware. However, superscalar and VLIW architectures achieve their performance from high clock frequencies and complicated control structures to infer parallelism, both of which contribute to high power usage and high cost. The authors propose a multimedia vector architecture, VIRAM, which explicitly exploits data parallelism rather than having to infer it, resulting in a high number of operations per instruction and low control complexity. The paper's contributions are the VIRAM architecture and compiler themselves, plus an excellent evaluation of the performance, cost, and power of VIRAM and other architectures for embedded multimedia applications. VIRAM's innovations for adapting vector architectures to this domain include a large number of vector registers, support for several narrow data widths, permutation instructions to communicate across lanes without going to memory, and paged virtual addressing. VIRAM also introduces a more effective memory hierarchy than traditional reactive caches, which work poorly on multimedia applications with little global data reuse. VIRAM's combination of a vector2 EE392C: Lecture #2 register file and on-chip memories are more suited to the needs of multimedia applications. The strengths of the paper are its initial idea and the rigorous comparison of VIRAM to other architectures for all the relevant parameters. Analyses of vector length and the degree of vectorization in compiled programs help validate the effectiveness of their compiler. The description of the VIRAM architecture and compiler is not detailed enough for any weaknesses in their design to be apparent from this paper. One minor quibble with their evaluation is that, because the EEMBC benchmark is inherently biased toward architectures with caches, they never provide a truly fair comparison between the cache-less VIRAM and other architectures. An additional graph showing EEMBC run 10 times with 10 different data sets would provide a less biased point of comparison that could only be more favorable to VIRAM. Natural extensions and improvements to this work might involve improving scalability by partitioning the functional units into clusters and explicitly scheduling communication between these clusters; this could be coupled with FU-local registers to create a bandwidth hierarchy. It would also be interesting to evaluate VIRAM's performance on other application domains with high data parallelism, such as scientific computing. A final open question is how to configure a polymorphic architecture to look like VIRAM. 3. Imagine [2] This paper introduces the stream programming model and a specific implementation of a stream processor (Imagine). Applications developed for this system are shown to sustain levels of performance comparable to special-purpose embedded processors. Media applications such as signal processing, graphics, and audio and video compression demand high levels of performance and exhibit large amounts of parallelism that lend well to efficient custom implementations. However, these applications run inefficiently on conventional microprocessors since their memory access patterns, which typically exhibit little reuse, are a poor match to cache memory hierarchies. The stream programming model expresses a computation as a set of kernels that operate on streams. A stream is a sequence homogeneous records. A kernel specifies a set of operations to be applied to every element of its input stream(s). A stream program specifies a set of kernels and orchestrates the flow of data streams among them. This model exposes the data communication at stream granularity as well as the data parallelism among operations applied to elements of a stream. Imagine is a coprocessor that executes applications expressed using the stream programming model. It consists of 48 floating-point ALUs organized into 8 identical clusters. A microcontroller issues instructions that control all 8 clusters in SIMD fashion, with VLIW control of multiple units within clusters.3 EE392C: Lecture #2 Each ALU is fed by a pair of dedicated local register files (LRF) with a high-bandwidth interconnect that allows results from any ALU to be written to any LRF within the cluster. The LRF’s provide a total peak bandwidth of 544 Gbytes/s in the 8 clusters at 500 MHz. These LRF’s form the highest-bandwidth level of a hierarchy optimized to capture data locality. The next level of the hierarchy is a 128 KB stream register file (SRF) that provides 32 GB/s peak, and captures inter-kernel stream locality. Finally, the off-chip memory provides a peak of 2.67 GB/s. On 4 media application benchmarks, Imagine sustains 5.1 to 18.3 GOPs and 1 to 2 orders of magnitude bandwidth filtration at each level of the hierarchy. Power consumption is estimated to be 2.2 W to 3.6 W for the same benchmarks. A key advantage of the stream programming model is that it explicitly expresses stream


View Full Document
Download Lesson 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lesson 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lesson 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?