DOC PREVIEW
U of U CS 6810 - Cache Optimization

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1 1 CS6810 School of Computing University of Utah Cache Optimization Today’s topics: Look at memory access times improve hit time reduce miss rate reduce miss penalty Full disclosure I’m winging this one more detail on white board than slides 2 CS6810 School of Computing University of Utah Optimize What • Basic quantitative metric  but in a parallel world it’s about exposed latency  note that miss penalty » f(transfer_rate/bandwidth, latency_next_lower_cache) » so improve bandwidth helps • Optimize  reduce hit time (Amdahl’s law – it’s the common case)  increase cache bandwidth  reduce miss penalty  reduce miss rate  increase overlapPage 2 3 CS6810 School of Computing University of Utah Knee of Curve Problem • Bigger problem for small caches – e.g. L1  latency vs. transfer time 4 CS6810 School of Computing University of Utah Reduce Hit Time • Small and simple caches  keep cache hierarchy on chip » off chip access is 10-100x penalty  small, direct-mapped » note L1$ size doesn’t change much w/ technology » L2 is where the biggest change occurs  associativity is a double edged sword • Next slide  models based on CACTI » common research tool » book model is CACTI-IV • note this doesn’t accurately deal w/ wire delay • current version is 6.5 – much better wire models – SPICE back annotated validation » undecided • might be used in HW4Page 3 5 CS6810 School of Computing University of Utah Hit Time Effects 6 CS6810 School of Computing University of Utah Hit-time Improvement #2 • Way prediction  basics covered last time » saves comparator power » increased associativity of set-assoc caches • reduces conflict misses • but way prediction give performance of direct mapped cache  tactics similar to branch prediction » PC of load or store • keep track of which way was hit – very similar to local Gselect predictor • keep track of progression – similar to stride prediction • Practice  way prediction used in both MIPS and Pentium IV processors • Prediction accuracy  ~85%Page 4 7 CS6810 School of Computing University of Utah Increase Cache Bandwidth • Pipelined caches  just like processor » pipelining can increase latency » fill and spill penalty when things go wrong  throughput improves on average » note L1 latency increasing • 3’ish cycles today but launch a new access every cycle • Non-blocking caches  for when L1+ misses occur  MSHR’s and dynamic issue logic 8 CS6810 School of Computing University of Utah Non-blocking Cache DataPage 5 9 CS6810 School of Computing University of Utah Multi-Banked Caches • Interleave for improved bandwidth  practice » Opteron – 2 banks » Sun Niagra – 4 banks  idea » no conflict accesses issues faster than bank latency » alternative to hit-under-miss & MSHR solution » banks are smaller so latency is reduced • Bigger advantage w/ multi-core and shared L2  downside » faster bus » OR multiple buses » note: • long wires on buses inherently induce problems – slower performance – C effect – high power – signal integrity – pre- and post-emphasis (e.g. QPI, HT, dynamic balance) 10 CS6810 School of Computing University of Utah Reduce Miss Penalty • Critical word first  overlap transfer with ability to use data  complicates next level access » not all that bad » at DRAM level • MEM_CTL in the way anyway • overhead to support is minor • Coalesce/Combining/Merge Write Buffer  writes happen from a register value  cache lines bigger » so buffer writes by cache line » same unit of transfer • cache to cache • cache to DRAMPage 6 11 CS6810 School of Computing University of Utah Combining Write Buffer Example What are the cost effects? 12 CS6810 School of Computing University of Utah Reduce Miss Rate • Compiler optimizations  compiler knows cache organization » arrange code and data to minimize misses  loop interchange – improves spatial locality » walk cache line in inner loop • e.g. matrix multiply is the canonical example  blocking – improves spatial locality » put code into phases • do as much as you can on this data frame before moving to next frame • avoids register spill and refill as well as cache misses » matrix multiply againPage 7 13 CS6810 School of Computing University of Utah Matmul Example ijk 14 CS6810 School of Computing University of Utah Loop Interchange to jikPage 8 15 CS6810 School of Computing University of Utah Loop Interchange to kij 16 CS6810 School of Computing University of Utah All PossibilitiesPage 9 17 CS6810 School of Computing University of Utah Blocked MatMul Example 18 CS6810 School of Computing University of Utah Blocked MatMul PerformancePage 10 19 CS6810 School of Computing University of Utah Others • Prefetch  reduces miss penalty and miss rate » if done right » added complexity, power, and screw up potential • discussed last lecture  can be done either by HW or SW • Next level cache  reduces miss penalty » in best case  increases miss penalty » in worst case » “swing to miss” principle 20 CS6810 School of Computing University of Utah Ancillary Caches • Victim cache (Jouppi)  small cache to hold victimized lines  idea allows arbitrary associativity for small number of lines » total extra associativity = size of victim cache  downside » parallel check of regular and victim » fully associative • Trace cache (Weiser, Peleg)  Intel P4 » expensive – many instruction copies • Assist cache (HP and somebody you know)  1st touch goes to assist  2nd touch goes to regular cache » makes prefetch less likely to contaminate cache  downside » similar to victim cachePage 11 21 CS6810 School of Computing University of Utah Summary I 22 CS6810 School of Computing University of Utah Summary IIPage 12 23 CS6810 School of Computing University of Utah Conclusion • Cost  focus on HW cost » compiler viewed as free if you’re a HW geek  low » small caches, way-prediction,


View Full Document

U of U CS 6810 - Cache Optimization

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Cache Optimization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Cache Optimization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Cache Optimization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?