Berkeley COMPSCI 252 - Advanced Memory Hierarchy - D428150

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Advanced Memory Hierarchy

DOC PREVIEW

Berkeley COMPSCI 252 - Advanced Memory Hierarchy

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 14

This preview shows page 1-2-3-4-5 out of 14 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

EECS 252 Graduate Computer ArchitectureLec 16 – Advanced Memory HierarchyDavid PattersonElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~pattrsnhttp://vlsi.cs.berkeley.edu/cs252-s06 4/3/2006 CS252 s06 Adv. Memory Hieriarchy 2Outline• 11 Advanced Cache Optimizations• Administrivia• Memory Technology and DRAM optimizations• Virtual Machines• Xen VM: Design and Performance• Conclusion4/3/2006 CS252 s06 Adv. Memory Hieriarchy 3Why More on Memory Hierarchy?1101001,00010,000100,0001980 1985 1990 1995 2000 2005 2010YearPerformanceMem oryProcessorProcessor-MemoryPerformance GapGrowing4/3/2006 CS252 s06 Adv. Memory Hieriarchy 4Review: 6 Basic Cache Optimizations• Reducing hit time1. Giving Reads Priority over Writes • E.g., Read complete before earlier writes in write buffer2. Avoiding Address Translation during Cache Indexing• Reducing Miss Penalty3. Multilevel Caches• Reducing Miss Rate4. Larger Block size (Compulsory misses)5. Larger Cache size (Capacity misses)6. Higher Associativity (Conflict misses)4/3/2006 CS252 s06 Adv. Memory Hieriarchy 511 Advanced Cache Optimizations• Reducing hit time1.Small and simple caches2.Way prediction3.Trace caches• Increasing cache bandwidth4.Pipelined caches5.Multibanked caches6.Nonblocking caches• Reducing Miss Penalty7. Critical word first8. Merging write buffers• Reducing Miss Rate9. Compiler optimizations• Reducing miss penalty or miss rate via parallelism10.Hardware prefetching11.Compiler prefetching4/3/2006 CS252 s06 Adv. Memory Hieriarchy 61. Fast Hit times via Small and Simple Caches• Index tag memory and then compare takes time• ⇒ Small cache can help hit time since smaller memory takes less time to index– E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron– Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip• Simple ⇒ direct mapping– Can overlap tag check with data transmission since no choice• Access time estimate for 90 nm using CACTI model 4.0– Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches-0.501.001.502.002.5016 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MBCache sizeAccess time (ns)1-way 2-way 4-way 8-way4/3/2006 CS252 s06 Adv. Memory Hieriarchy 72. Fast Hit times via Way Prediction• How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? • Way prediction: keep extra bits in cache to predict the “way,” or block within the set, of next cache access. – Multiplexor is set early to select desired block, only 1 tag comparison performed that clock cycle in parallel with reading the cache data – Miss ⇒ 1stcheck other blocks for matches in next clock cycle• Accuracy ≈ 85%• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Used for instruction caches vs. data cachesHit TimeWay-Miss Hit TimeMiss Penalty4/3/2006 CS252 s06 Adv. Memory Hieriarchy 83. Fast Hit times via Trace Cache (Pentium 4 only; and last time?)• Find more instruction level parallelism?How avoid translation from x86 to microops? • Trace cache in Pentium 41. Dynamic traces of the executed instructions vs. static sequences of instructions as determined by layout in memory– Built-in branch predictor2. Cache the micro-ops vs. x86 instructions– Decode/translate from x86 to micro-ops on trace cache miss+1. ⇒ better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block)- 1. ⇒ complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size-1. ⇒ instructions may appear multiple times in multiple dynamic traces due to different branch outcomes4/3/2006 CS252 s06 Adv. Memory Hieriarchy 94: Increasing Cache Bandwidth by Pipelining• Pipeline cache access to maintain bandwidth, but higher latency• Instruction cache access pipeline stages:1: Pentium2: Pentium Pro through Pentium III 4: Pentium 4- ⇒ greater penalty on mispredicted branches - ⇒ more clock cycles between the issue of the load and the use of the data4/3/2006 CS252 s06 Adv. Memory Hieriarchy 105. Increasing Cache Bandwidth: Non-Blocking Caches• Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss– requires F/E bits on registers or out-of-order execution– requires multi-bank memories• “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests• “hit under multiple miss”or “miss under miss”may further lower the effective miss penalty by overlapping multiple misses– Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses– Requires muliple memory banks (otherwise cannot support)– Penium Pro allows 4 outstanding memory misses4/3/2006 CS252 s06 Adv. Memory Hieriarchy 11Value of Hit Under Miss for SPEC (old data)• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92Hit Under i Misses00.20.40.60.811.21.41.61.82eqntottespressoxlispcompressmdljsp2earfpppptomcatvswm256doducsu2corwave5mdljdp2hydro2dalvinnnasa7spice2g6ora0->11->22->64BaseIntegerFloating Point“Hit under n Misses”0->11->22->64Base4/3/2006 CS252 s06 Adv. Memory Hieriarchy 126: Increasing Cache Bandwidth via Multiple Banks• Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses– E.g.,T1 (“Niagara”) L2 has 4 banks• Banking works best when accesses naturally spread themselves across banks ⇒ mapping of addresses to banks affects behavior of memory system• Simple mapping that works well is “sequential interleaving”– Spread block addresses sequentially across banks– E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4is 0; bank 1 has all blocks whose address modulo 4 is 1; …4/3/2006 CS252 s06 Adv. Memory Hieriarchy 137. Reduce Miss Penalty: Early Restart and Critical Word First• Don’t wait for full block before restarting CPU• Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution– Spatial locality ⇒ tend to want next sequential word, so not clear

View Full Document