DOC PREVIEW
CMU CS 15740 - memory_wall_isca98

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AbstractCurrent high performance computer systems use complex,large superscalar CPUs that interface to the main memory througha hierarchy of caches and interconnect systems. These CPU-cen-tric designs invest a lot of power and chip area to bridge the wid-ening gap between CPU and main memory speeds. Yet, many largeapplications do not operate well on these systems and are limitedby the memory subsystem performance.This paper argues for an integrated system approach that usesless-powerful CPUs that are tightly integrated with advancedmemory technologies to build competitive systems with greatlyreduced cost and complexity. Based on a design study using thenext generation 0.25µm, 256Mbit dynamic random-access memory(DRAM) process and on the analysis of existing machines, weshow that processor memory integration can be used to build com-petitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-proces-sor simulations showing that the benefits of lower latency andhigher bandwidth can compensate for the restrictions on the sizeand complexity of the integrated processor. In this system, smalldirect mapped instruction caches with long lines are very effective,as are column buffer data caches augmented with a victim cache.1 IntroductionTraditionally, the development of processor and memorydevices has proceeded independently. Advances in process tech-nology, circuit design, and processor architecture have led to anear-exponential increase in processor speed and memory capac-ity. However, memory latencies have not improved as dramatically,and access times are increasingly limiting system performance, aphenomenon known as the Memory Wall [1] [2]. This problem iscommonly addressed by adding several levels of cache to thememory system so that small, high speed, static random-access-memory (SRAM) devices feed a superscalar microprocessor at lowlatencies. Combined with latency hiding techniques such asprefetching and proper code scheduling it is possible to run a highperformance processor at reasonable efficiencies, for applicationswith enough locality for the caches.The approach outlined above is used in high-end systems of allthe mainstream microprocessor architectures. While achievingimpressive performance on applications that fit nicely into theircaches, such as the Spec’92 [3] benchmarks, these platforms havebecome increasingly application sensitive. Large applications suchas CAD programs, databases or scientific applications often fail tomeet CPU-speed based expectations by a wide margin.The CPU-centric design philosophy has led to very complexsuperscalar processors with deep pipelines. Much of this complex-ity, for example out-of-order execution and register scoreboarding,is devoted to hiding memory system latency. Moreover, high-endmicroprocessors demand a large amount of support logic in termsof caches, controllers and data paths. Not including I/O, a state-of-the-art 10M transistor CPU chip may need a dozen large, hot andexpensive support chips for cache memory, cache controller, datapath, and memory controller to talk to main memory. This addsconsiderable cost, power dissipation, and design complexity. Tofully utilize this heavy-weight processor, a large memory system isrequired.FIGURE 1 : Compute System ComponentsThe effect of this design is to create a bottleneck, increasing thedistance between the CPU and memory — depicted in Figure 1. Itadds interfaces and chip boundaries, which reduce the availablememory bandwidth due to packaging and connection constraints;only a small fraction of the internal bandwidth of a DRAM deviceis accessible externally.We shall show that integrating the processor with the memorydevice avoids most of the problems of the CPU-centric designapproach and can offer a number of advantages that effectivelycompensate for the technological limitations of a single chipdesign.2 BackgroundThe relatively good performance of Sun’s Sparc-Station 5workstation (SS-5), with respect to contemporary high-end mod-els, provides evidence for the benefits of tighter memory-processorintegration.Targeted at the “low-end” of the architecture spectrum, the SS-5 contains a single-scalar MicroSparc CPU with single-level,small, on-chip caches (16KByte instruction, 8KByte data). Formachine simplicity the memory controller was integrated into theCPU, so the DRAM devices are driven directly by logic on the pro-cessor chip. A separate I/O-bus connects the CPU with peripheraldevices, which can access memory only through the CPU chip.A comparable “high-end” machine of the same era is the Sparc-Station 10/61 (SS-10/61), containing a super-scalar SuperSparcCPU with two cache levels; separate 20KB instruction and 16KBdata caches at level 1, and a shared 1MByte of cache at level 2.CPU$-SRAMCNTLDPDRAMMissing the Memory Wall:The Case for Processor/Memory IntegrationAshley Saulsbury†, Fong Pong, Andreas NowatzykSun Microsystems Computer Corporation†Swedish Institute of Computer Sciencee-mail: [email protected], [email protected] © 1996 Association for Computing MachineryTo appear in the proceedings of the23rd annual International Symposium on Computer Architecture, June 1996.Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the ACM copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the ACM. To copy other-wise, or to republish, requires a fee and/or special permission.Compared to the SS-10/61, the SS-5 has an inferior Spec’92-rating, yet, as shown in Table 1, it out-performs the SS-10/61 on alogic synthesis workload (Synopsys1[4]) that has a working set ofover 50 Mbytes.The reason for this discrepancy is the lower main memorylatency of the SS-5, which can compensate for the “slower” CPU.Figure 2 exposes the memory access times for the levels of thecache hierarchy by walking various-sized memory arrays with dif-ferent stride lengths. Codes that frequently miss the SS-10’s largelevel-2 cache will see lower access time on the SS-5.FIGURE 2 : SS-5 vs. SS-10 Latencies2The “Memory Wall” is perhaps the first of a number of impend-ing hurdles that, in the not-too-distant future, will impinge uponthe rapid growth in uniprocessor performance. The pressure toseek further performance through multiprocessor and other formsof parallelism will increase, but these


View Full Document

CMU CS 15740 - memory_wall_isca98

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download memory_wall_isca98
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view memory_wall_isca98 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view memory_wall_isca98 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?