A Performance Comparison of Contemporary DRAM Architectures Vinodh Cuppu Bruce Jacob Dept of Electrical Computer Engineering University of Maryland College Park ramvinod blj eng umd edu Brian Davis Trevor Mudge Dept of Electrical Engineering Computer Science University of Michigan Ann Arbor btdavis tnm eecs umich edu ABSTRACT Where is time spent in the primary memory system the memory system beyond the cache hierarchy but not including secondary disk or tertiary backup storage What is the performance benefit of exploiting the page mode of contemporary DRAMs In response to the growing gap between memory access time and processor speed DRAM manufacturers have created several new DRAM architectures This paper presents a simulation based performance study of a representative group each evaluated in a small system organization These small system organizations correspond to workstation class computers and use on the order of 10 DRAM chips The study covers Fast Page Mode Extended Data Out Synchronous Enhanced Synchronous Synchronous Link Rambus and Direct Rambus designs Our simulations reveal several things a current advanced DRAM technologies are attacking the memory bandwidth problem but not the latency problem b bus transmission speed will soon become a primary factor limiting memory system performance c the post L2 address stream still contains significant locality though it varies from application to application and d as we move to wider buses row access time becomes more prominent making it important to investigate techniques to exploit the available locality to decrease access time 1 INTRODUCTION In response to the growing gap between memory access time and processor speed DRAM manufacturers have created several new DRAM architectures This paper presents a simulation based performance study of a representative group evaluating each in terms of its effect on total execution time We simulate the performance of seven DRAM architectures Fast Page Mode 35 Extended Data Out 16 Synchronous 17 Enhanced Synchronous 10 Synchronous Link 38 Rambus 31 and Direct Rambus 32 While there are a number of academic proposals for new DRAM designs space limits us to covering only existent commercial parts To obtain accurate memory request timing for an aggressive out of order processor we integrate our code into the SimpleScalar tool set 4 This paper presents a baseline study of a small system DRAM organization these are systems with only a handful of DRAM chips 0 1 1GB We do not consider large system DRAM organizations with many gigabytes of storage that are highly interleaved The study asks and answers the following questions What is the effect of improvements in DRAM technology on the memory latency and bandwidth problems Contemporary techniques for improving processor performance and tolerating memory latency are exacerbating the memory bandwidth problem 5 Our results show that current DRAM architectures are attacking exactly this problem the most recent technologies SDRAM ESDRAM and Rambus have reduced the stall time due to limited bandwidth by a factor of three compared to earlier DRAM architectures However the memory latency component of overhead has not improved For the newer DRAM designs the time to extract the required data from the sense amps row caches for transmission on the memory bus is the largest component in the average access time though page mode allows this to be overlapped with column access and the time to transmit the data over the memory bus How much locality is there in the address stream that reaches the primary memory system The stream of addresses that miss the L2 cache contains a significant amount of locality as measured by the hit rates in the DRAM row buffers The hit rates for the applications studied range 8 95 with a mean hit rate of 40 for a 1MB L2 cache This does not include hits to the row buffers when making multiple DRAM requests to read one cache line We also make several observations First there is a one time tradeoff between cost bandwidth and latency to a point latency can be decreased by ganging together multiple DRAMs into a wide structure This trades dollars for bandwidth that reduces latency because a request size is typically much larger than the DRAM transfer width Page mode and interleaving are similar optimizations that work because a request size is typically larger than the bus width However the latency benefits are limited by bus and DRAM speeds to get further improvements one must run the DRAM core and bus at faster speeds Current memory busses are adequate for small systems but are likely inadequate for large ones Embedded DRAM 5 19 37 is not a near term solution as its performance is poor on high end workloads 3 Faster buses are more likely solutions witness the elimination of the slow intermediate memory bus in future systems 12 Another solution is to internally bank the memory array into many small arrays so that each can be accessed very quickly as in the MoSys Multibank DRAM architecture 39 Second widening buses will present new optimization opportunities Each application exhibits a different degree of locality and therefore benefits from page mode to a different degree As buses widen this effect becomes more pronounced to the extent that different applications can have average access times that differ by 50 This is a minor issue considering current bus technology However future bus technologies will expose the row access as the primary performance bottleneck justifying the exploration of mechanisms to exploit locality to guarantee hits in the DRAM row buffers e g rowbuffer victim caches prediction mechanisms etc Third while buses as wide as the L2 cache yield the best memory latency they cannot halve the latency of a bus half as wide Page mode overlaps the components of DRAM access when making multiple requests to the same row If the bus is as wide as a request one Copyright 1999 IEEE Published in the Proceedings of the 26th International Symposium on Computer Architecture May 2 4 1999 in Atlanta GA USA Personal use of this material is permitted However permission to reprint republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists or to reuse any copyrighted component of this work in other works must be obtained from the IEEE Contact Manager Copyrights and Permissions IEEE Service Center 445 Hoes Lane P O Box 1331 Piscataway NJ 08855 1331 USA
View Full Document
Unlocking...