UMD ENEE 759H - Concurrency, Latency, or System Overhead - D2905953

Home> Schools> University of Maryland, College Park> Electrical & Computer Engineering (ENEE) > ENEE 759H> Concurrency, Latency, or System Overhead

DOC PREVIEW

UMD ENEE 759H - Concurrency, Latency, or System Overhead

School name University of Maryland, College Park

Course Enee 759h- Advanced Topics In Computer Engineering: Highspeed Memory Systems

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1 ABSTRACT Given a fixed CPU architecture and a fixed DRAM timing specifica-tion, there is still a large design space for a DRAM system organiza-tion. Parameters include the number of memory channels, thebandwidth of each channel, burst sizes, queue sizes and organiza-tions, turnaround overhead, memory-controller page protocol, algo-rithms for assigning request priorities and scheduling requestsdynamically, etc. In this design space, we see a wide variation inapplication execution times; for example, execution times for SPECCPU 2000 integer suite on a 2-way ganged Direct Rambus organi-zation (32 data bits) with 64-byte bursts are 10–20% lower than exe-cution times on an otherwise identical configuration that uses 32-byte bursts. This represents two system configurations that are rela-tively close to each other in the design space; performance differ-ences become even more pronounced for designs further apart.This paper characterizes the sources of overhead in high-perfor-mance DRAM systems and investigates the most effective ways toreduce a system’s exposure to performance loss. In particular, welook at mechanisms to increase a system’s support for concurrenttransactions, mechanisms to reduce request latency, and mecha-nisms to reduce the “system overhead”—the portion of the primarymemory system’s overhead that is not due to DRAM latency butrather to things like turnaround time, request queueing, inefficien-cies due to read/write request interleaving, etc. Our simulator mod-els a 2GHz, highly aggressive out-of-order uniprocessor. Theinterface to the memory system is fully non-blocking, supporting upto 32 outstanding misses at both the level-1 and level-2 caches andsplit-transaction busses to all DRAM banks. 1 INTRODUCTION As many recent studies have shown, the memory system is one ofthe primary bottlenecks in current systems. Further, a number ofstudies show that, within the memory system, the memory busaccounts for a substantial portion of the primary memory’s over-head. For example, Schumann reports that, in Alpha workstations,30–60% of primary memory latency is attributable to system over-head rather than to latency of DRAM components [22]. Brown andSeltzer cite memory-bus turnaround as responsible for a factor-of-two difference between predicted execution time and actual mea-sured execution time on a Pentium Pro system [1]. Cuppu, et al.demonstrate the inability of a 128-bit 100MHz (1.6 GB/s) memorybus to keep up with high-performance DRAMs [5]. Bryg, et al. esti-mate that 20–30% of the Hewlett-Packard memory bus bandwidth islost to dead cycles in back-to-back read/write transactions [2]. There are a number of paths developers and researchers havetaken to reducing the overhead of the primary memory system.These have largely been divided into approaches that are focussedon the DRAM component and those that are focussed on the systemor bus component. For example, a simple DRAM-oriented approachhas been to increase DRAM bandwidth. This is the tack taken by thePC industry recently, with the widespread shift from 800 MB/sPC100 SDRAM systems to 1.1 GB/s PC133, 1.6 GB/s Direct Ram-bus, and 2.1 GB/s DDR266 SDRAM systems. This brings the mem-ory bandwidth of the PC up to that of traditional RISC workstations,such as several UltraSPARC and Alpha models, and to within anorder of magnitude of many server-class machines.Another approach is to reduce DRAM latency. DRAM vendorshave recently announced numerous core variations that improveaccess time. For example, Enhanced Memory System’s ESDRAMimproves performance over regular SDRAM by adding an SRAMcache for the full row buffer, thereby allowing precharge to beginimmediately after an access and DRAM writes to go directly to thecore without destroying read locality [5, 7, 6]. Fujitsu’s FCRAMsubdivides each internal bank by activating only a portion of eachword line, thereby reducing capacitance on the word access andimproving access time over that of standard SDRAM to roughly30ns [9, 10]. MoSys takes this a step further and subdivides the on-chip storage into a large number of very small banks (on the order of32KB each), reducing the access time of the DRAM core to nearlythat of SRAM [39, 19, 10]. Several vendors have placed largeamounts of SRAM onto the DRAM die, in addition to the row buff-ers, in an attempt to reduce latency. For example, NEC’S VCDRAMplaces a set-associative SRAM buffer on the die that holds an imple-mentation-defined number of sub-page (typically 10–100), where asub-page is a subset of the bits activated by a column access and ison the order of 16–32 bytes [6, 10]. Recent studies show that these DRAM-oriented approaches doreduce application execution time [5, 7]. However, focussing on theDRAM alone is not enough; we note that, even with zero-latencyDRAM access, the overhead of the primary memory system wouldnot reduce to zero, because bus transactions still require time. Tobegin with, there is the obvious time to transfer addresses and dataover the bus to and from the DRAM subsystem. In addition, factorssuch as turnaround time, queueing delays, and inefficiencies due toasymmetric read/write request shapes on an in-order bus all addtogether to produce a sizable overhead. In multiprocessor systems,the overhead is even larger, due to arbitration and cache coherencyprotocols—moreover, many uniprocessor systems share the memorybus with graphics chips in an organization that effectively makes theuniprocessor system behave like a multiprocessor. In addition to efforts aimed at improving DRAM devices, wemust also improve the connection between the CPU and the DRAMdevices—we must improve the overhead of the memory bus. Thereare a number of approaches one can take, including changing the Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance? Vinodh Cuppu and Bruce JacobDept. of Electrical & Computer EngineeringUniversity of Maryland, College Park{ramvinod,blj}@eng.umd.edu Copyright © 2001 IEEE. Published in the Proceedings of the 28th International Symposium on Computer Architecture , June 30–July 4, 2001 in Göteborg, Sweden. Personal use of this mate-rial is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists,or to reuse any copyrighted component of this work in other works, must be obtained from

View Full Document