DOC PREVIEW
UMD CMSC 411 - Lecture 15 Memory Hierarchy 2

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS252 S05CMSC 411Computer Systems ArchitectureLecture 15Memory Hierarchy 2(Cache Optimizations)CMSC 41 1 - 13 (so me from Patterso n, Suss man, others )2So Far….• Fully associative cache– Memory block can be stored in any cache block• Write-through cache– Write (store) changes both cache and main memory right away– Reads only require getting block on cache miss• Write-back cache– Write changes only cache– Read causes write of dirty block to memory on a replace• Reads easy to make fast, writes harder– Read data from cache in parallel with checking address against tag of cache block– Write must verify address against tag before updateCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )3Example: Alpha 21064Write Buffers for Write-Through CachesQ. Why a write buffer ? ProcessorCacheWrite BufferLower Level MemoryHolds data awaiting write-through to lower level memoryA. So CPU doesn’t stall Q. Why a buffer, why not just one register ?A. Bursts of writes arecommon.Q. Are Read After Write (RAW) hazards an issue for write buffer?A. Yes! Drain buffer before next read, or send read 1stafter check write buffers.4CMSC 41 1 - 13 (som e from P atterson, Sussma n, others )CMSC 41 1 - 13 (so me from Patterso n, Suss man, others )5How much do stalls slow a machine?• Suppose that on pipelined MIPS, each instruction takes, on average, 2 clock cycles, not counting cache faults/misses• Suppose, on average, there are 1.33 memory references per instruction, memory access time is 50 cycles, and the miss rate is 2%• Then each instruction takes, on average: 2 + (0 × .98) + (1.33 × .02 × 50) = 3.33 clock cyclesCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )6Memory stalls (cont.)• To reduce the impact of cache misses, can reduce any of three parameters:– Main memory access time (miss penalty)– Cache access (hit) time– Miss rateCS252 S05Example: Apple iMac G5iMac G51.6 GHzReg L1 Inst L1 Data L2 DRAM DiskSize1K 64K 32K 512K 256M 80GLatencyCycles, Time1,0.6 ns3,1.9 ns3,1.9 ns11,6.9 ns88,55 ns107,12 msLet programs address a memory space that scales to the disk size, at a speed that is usually as fast as register accessManaged by compilerManaged by hardwareManaged by OS,hardware,applicationGoal: Illusion of large, fast, cheap memory7CMSC 411 - 13 (some fr om Patters on, Sussman, others)iMac’s PowerPC 970: All caches on-chip(1K)512KL2L1 (64K Instruction)L1 (32K Data)8CMSC 41 1 - 13 (so me from Patterso n, Suss man, others )CMSC 41 1 - 13 (so me from Patterso n, Suss man, others )9Cache miss terminology• Sometimes cache misses are inevitable:– Compulsory miss» The first time a block is used, need to bring it into cache – Capacity miss» If need to use more blocks at once than can fit into cache, some will bounce in and out– Conflict miss» In direct mapped or set associative caches, there are certain combinations of addresses that cannot be in cache at the same timeCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )10Miss rateFig. C.9 SPEC2000, LRU replacementCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )115 Basic Cache Optimizations• Reducing Miss Rate1. Larger Block size (compulsory misses)2. Larger Cache size (capacity misses)3. Higher Associativity (conflict misses)• Reducing Miss Penalty4. Multilevel Caches• Reducing hit time5. Giving Reads Priority over Writes » E.g., Read completes before earlier writes in write bufferCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )12More Terminology• ‘write-allocate’– Ensure block in cache before performing a write operation• ‘write-no-allocate’– Don’t allocate block in cache if not already thereCS252 S05CMSC 41 1 - 13 (so me from Patterso n, Suss man, others )13Another write buffer optimization• Write buffer mechanics, with merging– An entry may contain multiple words (maybe even a whole cache block)– If there’s an empty entry, the data and address are written to the buffer, and the CPU is done with the write– If buffer contains other modified blocks, check to see if new address matches one already in the buffer– if so, combine the new data with that entry– If buffer full and no address match, cache and CPU wait for an empty entry to appear (meaning some entry has been written to main memory)– Merging improves memory efficiency, since multi-word writes usually faster than one word at a timeCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )14Don't wait for whole block on cache miss• Two ways to do this – suppose need the 10th word in a block:– Early restart» Access the required word as soon as it is fetched, instead of waiting for the whole block– Critical word first» Start the fetch with word 10, and fill in the first few laterCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )15Use a nonblocking cache• With this optimization, the cache doesn't stop for a miss, but continues to process later requests if possible, even though an earlier one is not yet fulfilled– Introduces significant complexity into cache architecture – have to allow multiple outstanding cache requests (maybe even multiple misses)– but this is what’s done in modern processorsCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )16So Far (cont.)• Reducing memory stalls– Reduce miss penalty, miss rate, cache hit time• Reducing miss penalty– Give priority to read over write misses– Don’t wait for the whole block– Use a non-blocking cacheCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )17Multi-level cache• For example, if cache takes 1 clock cycle, and memory takes 50, might be a good idea to add a larger (but necessarily slower) secondary cache in between, perhaps capable of 10 clock cycle access• Complicates performance analysis (see H&P), but 2ndlevel cache captures many of 1stlevel cache misses, lowering effective miss penalty– and 3rdlevel cache has same benefits for 2ndlevel cache• Most modern machines have separate 1stlevel instruction and data caches, shared 2ndlevel cache– and off processor chip shared 3rdlevel cacheCMSC 41 1 - 13 (so me from Patterso n, Suss man, others )18Victim caches• To remember a cache block that has recently been replaced (evicted)– Use a small, fully associative cache between a cache and where it gets data from– Check the victim cache on a cache miss, before going to next lower-level memory» If found, swap victim block and cache block– Reduces conflict


View Full Document

UMD CMSC 411 - Lecture 15 Memory Hierarchy 2

Documents in this Course
Load more
Download Lecture 15 Memory Hierarchy 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 15 Memory Hierarchy 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 15 Memory Hierarchy 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?