UMD ENEE 759H - Reducing DRAM Latencies with an Integrated Memory Hierarchy Design - D1666659

Home> Schools> University of Maryland, College Park> Electrical & Computer Engineering (ENEE) > ENEE 759H> Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

DOC PREVIEW

UMD ENEE 759H - Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

School name University of Maryland, College Park

Course Enee 759h- Advanced Topics In Computer Engineering: Highspeed Memory Systems

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Reducing DRAM Latencies with an Integrated Memory Hierarchy DesignThis work is supported in part by the National Science Foundation underGrant No. CCR-9734026, a gift from Intel, IBM University PartnershipProgram Awards, and an equipment grant from Compaq.Wei-fen Lin and Steven K. ReinhardtElectrical Engineering and Computer Science Dept.University of Michigan{wflin,stever}@eecs.umich.eduDoug BurgerDepartment of Computer SciencesUniversity of Texas at [email protected] in the 7th International Symposium on High-Performance Computer Architecture, January 2001.AbstractIn this paper, we address the severe performance gapcaused by high processor clock rates and slow DRAMaccesses. We show that even with an aggressive, next-gen-eration memory system using four Direct Rambus chan-nels and an integrated one-megabyte level-two cache, aprocessor still spends over half of its time stalling for L2misses. Large cache blocks can improve performance, butonly when coupled with wide memory channels. DRAMaddress mappings also affect performance significantly.We evaluate an aggressive prefetch unit integrated withthe L2 cache and memory controllers. By issuingprefetches only when the Rambus channels are idle, prior-itizing them to maximize DRAM row buffer hits, and giv-ing them low replacement priority, we achieve a 43%speedup across 10 of the 26 SPEC2000 benchmarks, with-out degrading performance on the others. With eight Ram-bus channels, these ten benchmarks improve to within10% of the performance of a perfect L2 cache.1. IntroductionContinued improvements in processor performance,and in particular sharp increases in clock frequencies, areplacing increasing pressure on the memory hierarchy.Modern system designers employ a wide range of tech-niques to reduce or tolerate memory-system delays,including dynamic scheduling, speculation, and multi-threading in the processing core; multiple levels of caches,non-blocking accesses, and prefetching in the cache hier-archy; and banking, interleaving, access scheduling, andhigh-speed interconnects in main memory.In spite of these optimizations, the time spent in thememory system remains substantial. In Figure 1, wedepict the performance of the SPEC CPU2000 bench-marks for a simulated 1.6GHz, 4-way issue, out-of-ordercore with 64KB split level-one caches; a four-way, 1MBon-chip level-two cache; and a straightforward DirectRambus memory system with four 1.6GB/s channels. (Wedescribe our target system in more detail in Section 3.) Let and be the instructions percycle of each benchmark assuming the described memorysystem, the described L1 caches with a perfect L2 cache,and a perfect memory system (perfect L1 cache), respec-tively. The three sections of each bar, from bottom to top,represent , , and . By takingthe harmonic mean of these values across our benchmarks,and computing , weobtain the fraction of performance lost due to an imperfectmemory system.1 Similarly, the fraction of performancelost due to an imperfect L2 cache—the fraction of timespent waiting for L2 cache misses—is given by. (In Figure 1, the bench-marks are ordered according to this metric.) The differ-ence between these values is the fraction of time spentwaiting for data to be fetched into the L1 caches from theL2. For the SPEC CPU2000 benchmarks, our systemspends 57% of its time servicing L2 misses, 12% of itstime servicing L1 misses, and only 31% of its time per-forming useful computation.Since over half of our system’s execution time isspent servicing L2 cache misses, the interface between theL2 cache and DRAM is a prime candidate for optimiza-tion. Unfortunately, diverse applications have highly vari-able memory system behaviors. For example, mcf has thehighest L2 stall fraction (80%) because it suffers 23 mil-lion L2 misses during the 200-million-instruction samplewe ran, saturating the memory controller request band-width. At the other extreme, a 200M-instruction sample offacerec spends 60% of its time waiting for only 1.2 millionDRAM accesses.These varying behaviors imply that memory-systemoptimizations that improve performance for some applica-tions may penalize others. For example, prefetching mayimprove the performance of a latency-bound application,1. This equation is equivalent to ,where is the cycles per instruction for system .IRealIPerfectL2, IPerfectMemIRealIPerfectL2IPerfectMemIPerfectMemIReal–()IPerfectMem⁄CPIRealCPIPerfectMem–()CPIReal⁄CPIXXIPerfectL2IReal–()IPerfectL2⁄but will decrease the performance of a bandwidth-boundapplication by consuming scarce bandwidth and increas-ing queueing delays [4]. Conversely, reordering memoryreferences to increase DRAM bandwidth [5,11,15,16,19]may not help latency-bound applications, which rarelyissue concurrent memory accesses—and may even hurtperformance by increasing latency.In this paper, we describe techniques to reduce level-two miss latencies for memory-intensive applications thatare not bandwidth bound. These techniques complementthe current trend in newer DRAM architectures, whichprovide increased bandwidth without correspondingreductions in latency [7]. The techniques that we evaluate,in addition to improving the performance of latency-boundapplications, avoid significant performance degradationfor bandwidth-intensive applications.Our primary contribution is a proposed prefetchingengine specifically designed for level-two cache prefetch-ing on a Direct Rambus memory system. The prefetchengine utilizes scheduled region prefetching, in whichblocks spatially near the addresses of recent demandmisses are prefetched into the L2 cache only when thememory channel would otherwise be idle. We show thatthe prefetch engine improves memory system performancesubstantially (10% to 119%) for 10 of the 26 benchmarkswe study. We see smaller improvements for the remainingbenchmarks, limited by lower prefetch accuracies, a lackof available memory bandwidth, or few L2 misses. Ourprefetch engine is unintrusive, however, reducing perfor-mance for only one benchmark. Three mechanisms mini-mize the potential negative aspects of aggressiveprefetching: prefetching data only on idle Rambus channelcycles; scheduling prefetches to maximize hit rates in boththe L2 cache and the DRAM row buffers; and placing theprefetches in a low-priority position in the cache sets,reducing the impact of cache pollution.The remainder of the paper begins with a briefdescription of near-future memory systems in Section 2.

View Full Document