1 A Case for Intelligent RAM IRAM To appear in IEEE Micro April 1997 A Case for Intelligent RAM IRAM David Patterson Thomas Anderson Neal Cardwell Richard Fromm Kimberly Keeton Christoforos Kozyrakis Randi Thomas and Katherine Yelick Computer Science Division EECS Department University of California Berkeley CA 94720 1776 2 8 97 email patterson cs berkeley edu Abstract Two trends call into question the current practice of microprocessors and DRAMs being fabricated as different chips on different fab lines 1 the gap between processor and DRAM speed is growing at 50 per year and 2 the size and organization of memory on a single DRAM chip is becoming awkward to use in a system yet size is growing at 60 per year Intelligent RAM or IRAM merges processing and memory into a single chip to lower memory latency increase memory bandwidth and improve energy efficiency as well as to allow more flexible selection of memory size and organization In addition IRAM promises savings in power and board area This paper reviews the state of microprocessors and DRAMs today explores some of the opportunities and challenges for IRAMs and finally estimates performance and energy efficiency of three IRAM designs 1 Introduction and Why there is a Problem The division of the semiconductor industry into microprocessor and memory camps provides many advantages First and foremost a fabrication line can be tailored to the needs of the device Microprocessor fab lines offer fast transistors to make fast logic and many metal layers to accelerate communication and simplify power distribution while DRAM fabs offer many polysilicon layers to achieve both small DRAM cells and low leakage current to reduce the DRAM refresh rate Separate chips also mean separate packages allowing microprocessors to use expensive packages that dissipate high power 5 to 50 watts and provide hundreds of pins to make wide connections to external memory while allowing DRAMs to use inexpensive packages which dissipate low power 1 watt and use only a few dozen pins Separate packages in turn mean computer designers can scale the number of memory chips independent of the number of processors most desktop systems have 1 processor and 4 to 32 DRAM chips but most server systems have 2 to 16 processors and 32 to 256 DRAMs Memory systems have standardized on the Single In line Memory Module SIMM or Dual In line Memory Module DIMM which allows the end user to scale the amount of memory in a system Quantitative evidence of the success of the industry is its size in 1995 DRAMs were a 37B industry and microprocessors were a 20B industry In addition to financial success the technologies of these industries have improved at unparalleled rates DRAM capacity has quadrupled on average every 3 years since 1976 while microprocessor speed has done the same since 1986 1 2 Introduction and Why there is a Problem 1000 CPU 100 Processor Memory Performance Gap 10 DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1 FIGURE 1 Processor Memory Performance Gap Hen96 The split into two camps has its disadvantages as well Figure 1 shows that while microprocessor performance has been improving at a rate of 60 per year the access time to DRAM has been improving at less than 10 per year Hence computer designers are faced with an increasing Processor Memory Performance Gap which is now the primary obstacle to improved computer system performance System architects have attempted to bridge the processor memory performance gap by introducing deeper and deeper cache memory hierarchies unfortunately this makes the memory latency even longer in the worst case For example Table 1 shows CPU and memory performance in a recent high performance computer system Note that the main memory latency in this system is a factor of four larger than the raw DRAM access time this difference is due to the time to drive the address off the microprocessor the time to multiplex the addresses to the DRAM the time to turn around the bidirectional data bus the overhead of the memory controller the latency of the SIMM connectors and the time to drive the DRAM pins first with the address and then with the return data 3 A Case for Intelligent RAM IRAM To appear in IEEE Micro April 1997 Table 1 The latency and bandwidth of the memory system of a high performance computer Processor Alpha 21164 Machine AlphaServer 8200 Clock Rate 300 MHz Memory Performance Latency Bandwidth I Cache 8KB on chip 6 7 ns 2 clocks 4800 MB sec D Cache 8KB on chip 6 7 ns 2 clocks 4800 MB sec L2 Cache 96KB on chip 20 ns 6 clocks 4800 MB sec L3 Cache 4MB off chip 26 ns 8 clocks 960 MB sec Main Memory Subsystem 253 ns 76 clocks 1200 MB sec Single DRAM component 60ns 18 clocks 30 100 MB sec Table 2 CPI cache misses and time spent in Alpha 21164 for four programs Category SPECint92 SPECfp92 Database Sparse 1 2 1 2 3 6 3 0 I cache misses per 1000 instructions 7 2 97 0 D cache misses per 1000 instructions 25 47 82 38 L2 cache misses per 1000 instructions 11 12 119 36 L3 cache misses per 1000 instructions 0 0 13 23 Fraction of time in processor 0 78 0 68 0 23 0 27 Fraction of time in I cache misses 0 03 0 01 0 16 0 00 Fraction of time in D cache misses 0 13 0 23 0 14 0 08 Fraction of time in L2 cache misses 0 05 0 06 0 20 0 07 Fraction of time in L3 cache misses 0 00 0 02 0 27 0 58 Clocks Per Instruction CPI Despite huge on and off chip caches and very sophisticated processors with out of order 1 Introduction and Why there is a Problem 4 dynamically scheduled superscalar pipelines capable of executing multiple instructions per clock cycle the long latency and limited bandwidth to main memory dominates performance for many applications For example Table 2 shows clock cycles per instruction CPI cache misses and fraction of time spent in each component of the Alpha 21164 for the SPEC92 integer CPU benchmarks SPEC92 floating point CPU benchmarks a data base program running a debit credit benchmark and a sparse matrix calculation called Sparse Linpack Cve96 The database and matrix computations spend about 75 of their time in the memory hierarchy Although the 21164 is capable of executing 4 instructions per clock cycle for a peak CPI of 0 25 the average CPI for these applications was 3 0 to 3 6 Digital has since started shipping a 437 MHz version of the same processor with the same external memory system with almost a 50 faster clock an even larger fraction of application time will be
View Full Document
Unlocking...