Unformatted text preview:

Missing the Memory Wall The Case for Processor Memory Integration Ashley Saulsbury Fong Pong Andreas Nowatzyk Sun Microsystems Computer Corporation Swedish Institute of Computer Science e mail ans sics se agn acm org Abstract Current high performance computer systems use complex large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems These CPU centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds Yet many large applications do not operate well on these systems and are limited by the memory subsystem performance This paper argues for an integrated system approach that uses less powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity Based on a design study using the next generation 0 25 m 256Mbit dynamic random access memory DRAM process and on the analysis of existing machines we show that processor memory integration can be used to build competitive scalable and cost effective MP systems We present results from execution driven uni and multi processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor In this system small direct mapped instruction caches with long lines are very effective as are column buffer data caches augmented with a victim cache 1 Introduction Traditionally the development of processor and memory devices has proceeded independently Advances in process technology circuit design and processor architecture have led to a near exponential increase in processor speed and memory capacity However memory latencies have not improved as dramatically and access times are increasingly limiting system performance a phenomenon known as the Memory Wall 1 2 This problem is commonly addressed by adding several levels of cache to the memory system so that small high speed static random accessmemory SRAM devices feed a superscalar microprocessor at low latencies Combined with latency hiding techniques such as prefetching and proper code scheduling it is possible to run a high performance processor at reasonable efficiencies for applications with enough locality for the caches The approach outlined above is used in high end systems of all the mainstream microprocessor architectures While achieving impressive performance on applications that fit nicely into their caches such as the Spec 92 3 benchmarks these platforms have become increasingly application sensitive Large applications such as CAD programs databases or scientific applications often fail to meet CPU speed based expectations by a wide margin Copyright 1996 Association for Computing Machinery To appear in the proceedings of the 23rd annual International Symposium on Computer Architecture June 1996 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage the ACM copyright notice and the title of the publication and its date appear and notice is given that copying is by permission of the ACM To copy otherwise or to republish requires a fee and or special permission The CPU centric design philosophy has led to very complex superscalar processors with deep pipelines Much of this complexity for example out of order execution and register scoreboarding is devoted to hiding memory system latency Moreover high end microprocessors demand a large amount of support logic in terms of caches controllers and data paths Not including I O a state ofthe art 10M transistor CPU chip may need a dozen large hot and expensive support chips for cache memory cache controller data path and memory controller to talk to main memory This adds considerable cost power dissipation and design complexity To fully utilize this heavy weight processor a large memory system is required DRAM SRAM CPU DP CNTL FIGURE 1 Compute System Components The effect of this design is to create a bottleneck increasing the distance between the CPU and memory depicted in Figure 1 It adds interfaces and chip boundaries which reduce the available memory bandwidth due to packaging and connection constraints only a small fraction of the internal bandwidth of a DRAM device is accessible externally We shall show that integrating the processor with the memory device avoids most of the problems of the CPU centric design approach and can offer a number of advantages that effectively compensate for the technological limitations of a single chip design 2 Background The relatively good performance of Sun s Sparc Station 5 workstation SS 5 with respect to contemporary high end models provides evidence for the benefits of tighter memory processor integration Targeted at the low end of the architecture spectrum the SS5 contains a single scalar MicroSparc CPU with single level small on chip caches 16KByte instruction 8KByte data For machine simplicity the memory controller was integrated into the CPU so the DRAM devices are driven directly by logic on the processor chip A separate I O bus connects the CPU with peripheral devices which can access memory only through the CPU chip A comparable high end machine of the same era is the SparcStation 10 61 SS 10 61 containing a super scalar SuperSparc CPU with two cache levels separate 20KB instruction and 16KB data caches at level 1 and a shared 1MByte of cache at level 2 Compared to the SS 10 61 the SS 5 has an inferior Spec 92rating yet as shown in Table 1 it out performs the SS 10 61 on a logic synthesis workload Synopsys1 4 that has a working set of over 50 Mbytes Machine SS 5 SS 10 61 Spec 92 Int 64 89 Spec 92 Fp Synopsys Run Time 54 6 32 minutes 103 44 minutes TABLE 1 SS 5 vs SS 10 Synopsis Performance The reason for this discrepancy is the lower main memory latency of the SS 5 which can compensate for the slower CPU Figure 2 exposes the memory access times for the levels of the cache hierarchy by walking various sized memory arrays with different stride lengths Codes that frequently miss the SS 10 s large level 2 cache will see lower access time on the SS 5 800 stride 4 stride 16 stride 256 stride 4 stride 16 stride 256 Access latency ns 700 600 500 65 MHz SuperSparc SS 10 61 85 MHz MicroSparc2 SS 5 400 300 200 100 0 1 10 100 1000 10000 Array size KBytes FIGURE 2 SS 5 vs SS 10 Latencies2 The Memory Wall is perhaps the


View Full Document

CMU CS 15740 - memory_wall_isca98

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view memory_wall_isca98 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view memory_wall_isca98 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?