CMU CS 15740 - A Case for Intelligent RAM: IRAM - D592811

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15740> A Case for Intelligent RAM: IRAM

CMU CS 15740 - A Case for Intelligent RAM: IRAM

Pages 23

Download Save

Unformatted text preview:

1 A Case for Intelligent RAM: IRAM (To appear in IEEE Micro, April 1997) A Case for Intelligent RAM: IRAM David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine YelickComputer Science Division/EECS DepartmentUniversity of California, Berkeley CA 94720-17762/8/97 email: [email protected]: Two trends call into question the current practice of microprocessors and DRAMs being fabricated as different chips on different fab lines: 1) the gap between processor and DRAM speed is growing at 50% per year; and 2) the size and organization of memory on a single DRAM chip is becoming awkward to use in a system, yet size is growing at 60% per year. Intelligent RAM, or IRAM, merges processing and memory into a single chip to lower memory latency, increase memory bandwidth, and improve energy efficiency as well as to allow more flexible selection of memory size and organization. In addition, IRAM promises sav-ings in power and board area. This paper reviews the state of microprocessors and DRAMs today, explores some of the opportunities and challenges for IRAMs, and finally estimates performance and energy effi-ciency of three IRAM designs. 1. Introduction and Why there is a Problem The division of the semiconductor industry into microprocessor and memory camps pro-vides many advantages. First and foremost, a fabrication line can be tailored to the needs of the device. Microprocessor fab lines offer fast transistors to make fast logic and many metal layers to accelerate communication and simplify power distribution, while DRAM fabs offer many polysilicon layers to achieve both small DRAM cells and low leakage current to reduce the DRAM refresh rate. Separate chips also mean separate packages, allowing microprocessors to use expensive packages that dissipate high power (5 to 50 watts) and provide hundreds of pins to make wide connections to external memory, while allowing DRAMs to use inexpensive packages which dissipate low power (1 watt) and use only a few dozen pins. Separate packages in turn mean computer designers can scale the number of memory chips independent of the number of processors: most desktop systems have 1 processor and 4 to 32 DRAM chips, but most server systems have 2 to 16 proces-sors and 32 to 256 DRAMs. Memory systems have standardized on the Single In-line Memory Module (SIMM) or Dual In-line Memory Module (DIMM), which allows the end user to scale the amount of memory in a system.Quantitative evidence of the success of the industry is its size: in 1995 DRAMs were a $37B industry and microprocessors were a $20B industry. In addition to financial success, the technologies of these industries have improved at unparalleled rates. DRAM capacity has quadrupled on average every 3 years since 1976, while microprocessor speed has done the same since 1986.1. Introduction and Why there is a Problem 2 The split into two camps has its disadvantages as well. Figure 1 shows that while micro-processor performance has been improving at a rate of 60% per year, the access time to DRAM has been improving at less than 10% per year. Hence computer designers are faced with an increasing “ Processor-Memory Performance Gap ,” which is now the pri-mary obstacle to improved computer system performance.System architects have attempted to bridge the processor-memory performance gap by introducing deeper and deeper cache memory hierarchies; unfortunately, this makes the memory latency even longer in the worst case. For example, Table 1 shows CPU and memory performance in a recent high performance computer system. Note that the main memory latency in this system is a factor of four larger than the raw DRAM access time; this difference is due to the time to drive the address off the microprocessor, the time to multiplex the addresses to the DRAM, the time to turn around the bidirectional data bus, the overhead of the memory controller, the latency of the SIMM connectors, and the time to drive the DRAM pins first with the address and then with the return data. FIGURE 1. Processor-Memory Performance Gap.[Hen96] .110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap3 A Case for Intelligent RAM: IRAM (To appear in IEEE Micro, April 1997) Despite huge on- and off-chip caches and very sophisticated processors with out-of-order, Table 1: The latency and bandwidth of the memory system of a high performance computer. Processor Alpha 21164Machine AlphaServer 8200Clock Rate 300 MHzMemory Performance Latency BandwidthI Cache (8KB on chip) 6.7 ns (2 clocks) 4800 MB/secD Cache (8KB on chip) 6.7 ns (2 clocks) 4800 MB/secL2 Cache (96KB on chip) 20 ns (6 clocks) 4800 MB/secL3 Cache (4MB off chip) 26 ns (8 clocks) 960 MB/secMain Memory Subsystem 253 ns (76 clocks) 1200 MB/secSingle DRAM component ≈ 60ns (18 clocks) ≈ 30–100 MB/sec Table 2: CPI, cache misses, and time spent in Alpha 21164 for four programs Category SPECint92 SPECfp92 Database Sparse Clocks Per Instruction (CPI) 1.2 1.2 3.6 3.0 I cache misses per 1000 instructions 7 2 97 0 D cache misses per 1000 instructions 25 47 82 38L2 cache misses per 1000 instructions 11 12 119 36L3 cache misses per 1000 instructions 0 0 13 23Fraction of time in processor 0.78 0.68 0.23 0.27Fraction of time in I cache misses 0.03 0.01 0.16 0.00Fraction of time in D cache misses 0.13 0.23 0.14 0.08Fraction of time in L2 cache misses 0.05 0.06 0.20 0.07Fraction of time in L3 cache misses 0.00 0.02 0.27 0.581. Introduction and Why there is a Problem 4 dynamically scheduled superscalar pipelines capable of executing multiple instructions per clock cycle, the long latency and limited bandwidth to main memory dominates per-formance for many applications. For example, Table 2 shows clock cycles per instruction (CPI), cache misses, and fraction of time spent in each component of the Alpha 21164 for the SPEC92 integer CPU benchmarks, SPEC92 floating point CPU benchmarks, a data base program running a debit-credit benchmark, and a sparse matrix calculation called Sparse Linpack[Cve96]. The database and matrix computations spend about 75% of their time in the memory hierarchy. Although the 21164 is capable of executing 4 instructions per clock cycle for a peak CPI of 0.25, the average CPI for these applications was 3.0 to 3.6. Digital has since started shipping a 437 MHz version of the same processor with

View Full Document


School:
Email:
New Password:
Confirm Password:

CMU CS 15740 - A Case for Intelligent RAM: IRAM

Sign up for free to view:

Please select your school