Copyright © 1999 IEEE. Published in theProceedings of the 26th International Symposium on Computer Architecture, May 2-4, 1999, in Atlanta GA, USA. Personal use of this material is per-mitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or toreuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O.Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.ABSTRACTIn response to the growing gap between memory access time andprocessor speed, DRAM manufacturers have created several newDRAM architectures. This paper presents a simulation-based per-formance study of a representative group, each evaluated in a smallsystem organization. These small-system organizations correspondto workstation-class computers and use on the order of 10 DRAMchips. The study covers Fast Page Mode, Extended Data Out, Syn-chronous, Enhanced Synchronous, Synchronous Link, Rambus, andDirect Rambus designs. Our simulations reveal several things: (a)current advanced DRAM technologies are attacking the memorybandwidth problem but not the latency problem; (b) bus transmis-sion speed will soon become a primary factor limiting memory-sys-tem performance; (c) the post-L2 address stream still containssignificant locality, though it varies from application to application;and (d) as we move to wider buses, row access time becomes moreprominent, making it important to investigate techniques to exploitthe available locality to decrease access time.1 INTRODUCTIONIn response to the growing gap between memory access time andprocessor speed, DRAM manufacturers have created several newDRAM architectures. This paper presents a simulation-based perfor-mance study of a representative group, evaluating each in terms ofits effect on total execution time. We simulate the performance ofseven DRAM architectures: Fast Page Mode [35], Extended DataOut [16], Synchronous [17], Enhanced Synchronous [10], Synchro-nous Link [38], Rambus [31], and Direct Rambus [32]. While thereare a number of academic proposals for new DRAM designs, spacelimits us to covering only existent commercial parts. To obtain accu-rate memory-request timing for an aggressive out-of-order proces-sor, we integrate our code into the SimpleScalar tool set [4].This paper presents a baseline study of a small-system DRAMorganization: these are systems with only a handful of DRAM chips(0.1–1GB). We do not consider large-system DRAM organizationswith many gigabytes of storage that are highly interleaved. Thestudy asks and answers the following questions:• What is the effect of improvements in DRAM technology on thememory latency and bandwidth problems?Contemporary techniques for improving processor performanceand tolerating memory latency are exacerbating the memorybandwidth problem [5]. Our results show that current DRAMarchitectures are attacking exactly this problem: the most recenttechnologies (SDRAM, ESDRAM, and Rambus) have reducedthe stall time due to limited bandwidth by a factor of threecompared to earlier DRAM architectures. However, thememory-latency component of overhead has not improved.• Where is time spent in the primary memory system (the memorysystem beyond the cache hierarchy, but not including secondary[disk] or tertiary [backup] storage)? What is the performancebenefit of exploiting the page mode of contemporary DRAMs?For the newer DRAM designs, the time to extract the requireddata from the sense amps/row caches for transmission on thememory bus is the largest component in the average access time,though page mode allows this to be overlapped with columnaccess and the time to transmit the data over the memory bus.• How much locality is there in the address stream that reaches theprimary memory system?The stream of addresses that miss the L2 cache contains asignificant amount of locality, as measured by the hit-rates in theDRAM row buffers. The hit rates for the applications studiedrange 8–95%, with a mean hit rate of 40% for a 1MB L2 cache.(This does not include hits to the row buffers when makingmultiple DRAM requests to read one cache-line.)We also make several observations. First, there is a one-time trade-off between cost, bandwidth, and latency: to a point, latency can bedecreased by ganging together multiple DRAMs into a wide struc-ture. This trades dollars for bandwidth that reduces latency becausea request size is typically much larger than the DRAM transferwidth. Page mode and interleaving are similar optimizations thatwork because a request size is typically larger than the bus width.However, the latency benefits are limited by bus and DRAM speeds:to get further improvements, one must run the DRAM core and busat faster speeds. Current memory busses are adequate for small sys-tems but are likely inadequate for large ones. Embedded DRAM [5,19, 37] is not a near-term solution, as its performance is poor onhigh-end workloads [3]. Faster buses are more likely solutions—wit-ness the elimination of the slow intermediate memory bus in futuresystems [12]. Another solution is to internally bank the memoryarray into many small arrays so that each can be accessed veryquickly, as in the MoSys Multibank DRAM architecture [39].Second, widening buses will present new optimization opportu-nities. Each application exhibits a different degree of locality andtherefore benefits from page mode to a different degree. As buseswiden, this effect becomes more pronounced, to the extent that dif-ferent applications can have average access times that differ by 50%.This is a minor issue considering current bus technology. However,future bus technologies will expose the row access as the primaryperformance bottleneck, justifying the exploration of mechanisms toexploit locality to guarantee hits in the DRAM row buffers: e.g. row-buffer victim caches, prediction mechanisms, etc.Third, while buses as wide as the L2 cache yield the best mem-ory latency, they cannot halve the latency of a bus half as wide. Pagemode overlaps the components of DRAM access when making mul-tiple requests to the same row. If the bus is as wide as a request, oneA Performance Comparison of Contemporary DRAM ArchitecturesVinodh Cuppu, Bruce Jacob Brian Davis, Trevor MudgeDept. of Electrical & Computer Engineering
View Full Document