Page 1 1 CS6810 School of Computing University of Utah Caches Today’s topics: Basics memory hierarchy locality cache models associative options calculating miss penalties some fundamental optimization issues 2 CS6810 School of Computing University of Utah The Problem • Widening memory gap DRAM latency CAGR = 7% CPU performance CAGR » 25% prior to 1986, 52% 86-2005, 20% thereafterPage 2 3 CS6810 School of Computing University of Utah The Solution • Deepen memory hierarchy make dependency on DRAM latency the rare case » if you can multi-ported single ported 4 CS6810 School of Computing University of Utah Balancing Act • As always performance vs. cost, capacity vs. latency, … on-die SRAM is expensive per bit » latency depends on size and organization » area – 6T/bit » power – not so bad since only small portion active/cycle DRAM » latency is awful in cycles w/ GHz clock speeds » area = 1T + 1C/bit - lots of details later • Dealing w/ cache size latency deepen cache hierarchy » small separate IL1$ and DL1$: minimize mem structural stalls » unified L2 reduces fragmentation problems » multicore introduces global L3 • may not continue to be a good idea • everything runs out of steam at some point – key is what % of L1 misses hit in L2 (induction on Ln & Ln+1)Page 3 5 CS6810 School of Computing University of Utah Locality • 2 basic types temporal » if you used this block then you will use it again soon spatial » if you used this block you will use a nearby one soon • Exploiting locality match of application memory usage and cache design some codes are touch once » encrypt/decrypt, encode/decode, … » here caches are a liability • why? 6 CS6810 School of Computing University of Utah Locality • 2 basic types temporal » if you used this block then you will use it again soon spatial » if you used this block you will use a nearby one soon • Exploiting locality match of application memory usage and cache design » if you match you win – simple as that some codes are touch once fall through misses » e.g. encrypt/decrypt, encode/decode, … (media influence) » here caches are a liability • you have to swing to miss – try L1 – miss? – try L2 – miss? … • a lot of extra time if you end up going to DRAM anyway historical tidbit » Seymour Cray didn’t believe in caches or DRAM • or even 2’s complement initiallyPage 4 7 CS6810 School of Computing University of Utah Performance Issues • Basics • Enter stalls IPC gets degraded by stalls » we’ve seen stalls due to hazards and pipeline issues » now the focus is on memory stalls • influence is based on load & store % • with good branch prediction most stalls are memory induced € CPUtime = IC∗ CPI ∗ cycle_timeCPI = 1frequencyIPC =1CPICPUtime =ICIPC ∗ frequency8 CS6810 School of Computing University of Utah Computing Memory Effect • Misses induce stalls ILP & TLP can overlap some of the penalty » but a miss is a surprise so some of the penalty won’t be hidden € XEQtime = (CPUcycles + MEMstallcycles) ∗ cycle_timeMEMstallcycles = num_misses∗miss_penaltyMEMstallcycles = IC ∗missesinstruction∗ miss_penaltyMEMstallcycles = IC ∗memory_accessesinstruction∗ miss_rate * miss_penaltySeparateMEMstallcycles = READstalls,Writestalls∑READstalls = IC ∗readsinstruction∗ read_miss_rate ∗ read_miss_penaltyWRITEstalls = IC ∗writesinstruction∗ write_miss_rate ∗ write_miss_penaltyPage 5 9 CS6810 School of Computing University of Utah 4 Questions = $ Organization Q1: Where can a block be placed? (examples shortly) » fully associative: answer = anywhere » direct mapped: answer = only one place » set-associative: in a small number of “ways” Q2: How is a block found? » 2 types of tags • status: valid and dirty (for now) • address: alias chance must be resolved – next slide Q3: Which block is replaced on a miss? » LRU – best but expensive – matches temporal locality model » FIFO – good but expensive – LRU but on 1st touch not use » random – defeats temporal locality but simple to implement » approximation by epochs – add “use” status tag Q4: What happens on a write miss? » hit: write-through or write-back (requires dirty flag) » miss: write-replace or write-around (modern CPU’s ~allow both) • a.k.a. write_allocate or write_no_allocate) 10 CS6810 School of Computing University of Utah Cache Components • Cache block or line size varies – 64B is a common choice » no reason why the block size can’t be larger the deeper you go in the memory hierarchy • cache lines typically the same size – reduces some complexity • memory to cache transfer in line sized chunks • disk to memory transfer in page sized chunks • 2 main structures & a bunch of logic data RAM – holds the cached data tag RAM – holds tag information » same number of “entries” as data RAM • entry = line for direct mapped or fully associative • entry = set for set-associative » width for set associative a number of ways » for each set of address tags • status tags present as wellPage 6 11 CS6810 School of Computing University of Utah Block Identification • Address tag = address tag » held in tag ram » size is what’s left over after index and block offset size index » log2(number of data RAM entries) block offset » says which byte, word, or half-word is to be moved to the target register • silly in a way – word or doubles transferred to register • appropriate byte or half-word is then used for the op » size = log2(line size) increase offset or index reduces tag size 12 CS6810 School of Computing University of Utah Cache Access 8-byte words 101000 Direct-mapped cache: each address maps to a unique address 8 words: 3 index bits Byte address Data array Sets Offset Alias problem?Page 7 13 CS6810 School of Computing University of Utah De-Alias by Matching Address Tag 8-byte words 101000 Direct-mapped cache: each
View Full Document