Unformatted text preview:

1G22.2243-001High Performance Computer ArchitectureLecture 8Memory Hierarchy Design (cont’d)October 24, 200710/24/2007 2Outline• Announcements– Lab Assignment 2 due back next week Oct 31sr– Lab Assignment 3 will be available that day• Memory hierarchy design– Cache optimizations –Main memory[ Hennessy/Patterson CA:AQA (4th Edition): Chapter 5]10/24/2007 3Recap: Improving Cache PerformanceCPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle timeMemory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)= Memory accesses x Miss rate x Miss penalty• Above assumes 1-cycle to hit in cache– Hard to achieve in current-day processors (faster clocks, larger caches)– More reasonable to also include hit time in the performance equation Average memory access time = Hit Time + Miss rate x Miss PenaltySmall/simple cachesAvoiding address translationPipelined cache accessTrace cachesLarger block sizeLarger cache sizeHigher associativityWay predictionCompiler optimizationsMultilevel cachesCritical word firstRead miss before write missMerging write buffersVictim cachesNonblocking cachesHardware prefetchingCompiler prefetchingABCD10/24/2007 4Recap: Cache Optimization SummaryTechnique MP MR HT ComplexityMultilevel caches + 2Early Restart & Critical Word 1st + 2Priority to Read Misses + 1Merging write buffer + 1Victim Caches + + 2Larger Block Size – + 0Higher Associativity + – 1Pseudo-Associative Caches + 2Compiler Reduce Misses + 0Non-Blocking Caches + 3HW Prefetching of Instr/Data + + 2/3Compiler Controlled Prefetching + + 3Avoiding Address Translation + 2Trace Cache + 310/24/2007 5Static Random Access Memory (SRAM)• A type of semiconductormemory used for Caches• Memory retains its contents as long as power remains applied Æ Static unlike dynamic RAM (DRAM) that needs to be periodically refreshed• Nevertheless it is volatile memory• Six transistors for each memory cell (bit)• Gate level access timeIn Out10/24/2007 6Dynamic RAM (DRAM)• Uses smaller number of transistors (one transistor per cell)• Uses a capacitor• Changes in the charge are detected and amplified• Cheaper than SRAM• Slower than SRAM210/24/2007 7DRAM (Cont’d)• Reading is destructive– When a bit is read you destroy the stored value (if it was a zero)–After reading a bit, you must re-write it (if it was a zero - in practice it is easier to re-write the value in all cases) • Stored values must be periodically refreshed– Bits are stored using capacitance, it is necessary to periodically re-write (refresh) the stored values because charge leaks away over time Æ dynamic– Refresh is typically managed by the memory subsystem• Reading is a relatively slow process– Because of need to re-write the contents of a bit that has been read, a stabilization period is required after reading – Process of detecting changes via sense amplifiers is quite slow10/24/2007 8Cache vs. Main Memory• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor /bit, more wires, area is 10X)– Address not divided• Main Memory is DRAM: Dynamic Random Access Memory– Dynamic since needs to be refreshed periodically (~8 ms, <5% time)– Addresses divided into 2 parts (Memory as a 2D matrix) sent one at a time to reduce the number of address pins:•RASor Row Access Strobe, CAS or Column Access Strobe• Performance of Main Memory: – Latency• Access Time: time between request and when word arrives• Cycle Time: time between requests– Bandwidth•DRAM/SRAM size ratio of 4 – 8 for comparable technologies, SRAM/DRAM cost, cycle time ratio 8 – 1610/24/2007 9Internal Organization of a 256M bit DRAM• Internally, might use banks of memory arrays– E.g., 256 1024x1024 arrays, or 64 2048x2048 arrays• Normally packaged as dual inline memory modules (DIMMs)– Typically 4-16 DRAM chips, 8 byte wide12Affects latency (~40ns)Affects bandwidth (~5ns)10/24/2007 10Improving Memory Performance in a DRAM• Increasingly important because fewer chips/systemEvolutionary• Fast page mode– Allow multiple CAS accesses without need for intervening RAS• Optimizes sequential access, exploiting the row buffer (1024-2048 bits)– Extended Data Out (EDO): 30% faster in page mode• Overlaps data output with CAS toggling• Synchronous DRAM (SDRAM)– Avoid need for handshaking between chip and memory controller– Chip also has a register with number of requested bytes: these are transmitted without explicit requests from controller• Double Data Rate (DDR) DRAM– Transmit data from chip on both the falling and rising edge of clock signal–DDR2is the next-generation DDR memory technology which features faster speeds, higher data bandwidths, lower power consumption, and enhanced thermal performance 10/24/2007 11DRAM History• DRAMs: capacity +60%/yr, cost –30%/yr– 2.5X cells/area, 1.5X die size in ~3 years• Rely on increasing numbers of computers and memory per computer– SIMM or DIMM is replaceable unit– computers can use any generation DRAM– Growth slowing because demand is coming down• Commodity industry– High volume, low profit, conservative– Little organization innovation in 20 years• Order of importance: (primary) Cost/bit, (secondary) Capacity– First RAMBUS: 10X BW, +30% cost, but little impact10/24/2007 12Higher Bandwidths310/24/2007 13Error Correction•Motivation:– Failures/time proportional to number of bits!– As DRAM cells shrink, more vulnerable• Went through period in which failure rate was low enough without error correction that people didn’t do correction– DRAM banks too large now– Servers always corrected memory systems• Basic idea: add redundancy through parity bits– Simple but wasteful version:• Keep three copies of everything, vote to find right value• 200% overhead– ECC (error correction code) SDRAM is memory that is able to detect and correct some SDRAM errors – Replaced parity memory which could only detect, but not correct errors – Most ECC SDRAMs can correct single bit errors and detect, but not correct larger errors • One example: 64 data bits + 8 parity bits (11% overhead)10/24/2007 14Improving Main Memory Performance• Making memory faster has been difficult• At least try to get it to transfer a lot of data Æ higher memory bandwidth1) Wider Main Memory:Timing model4 to send address56 access


View Full Document

NYU CSCI-GA 2243 - Memory Hierarchy Design

Download Memory Hierarchy Design
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Memory Hierarchy Design and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Memory Hierarchy Design 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?