Lecture 12: Caches IReview: Genetic Programming for DesignReview: Who Cares About the Memory Hierarchy?Processor-Memory Performance Gap “Tax”What is a cache?Generations of MicroprocessorsWhat happens on a Cache miss?Review: Four Questions for Memory Hierarchy DesignersReview: Cache PerformanceSlide 10Review: Improving Cache PerformanceReducing Misses3Cs Absolute Miss Rate (SPEC92)2:1 Cache Rule3Cs Relative Miss RateHow Can Reduce Misses?1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher AssociativityExample: Avg. Memory Access Time vs. Miss Rate3. Reducing Misses via a “Victim Cache”4. Reducing Misses via “Pseudo-Associativity”CS 252 Administrivia5. Reducing Misses by Hardware Prefetching of Instructions & Datals6. Reducing Misses by Software Prefetching Data7. Reducing Misses by Compiler OptimizationsMerging Arrays ExampleLoop Interchange ExampleLoop Fusion ExampleBlocking ExampleSlide 30Reducing Conflict Misses by BlockingSummary of Compiler Optimizations to Reduce Cache Misses (by hand)SummarySlide 341. Reducing Miss Penalty: Read Priority over Write on Miss2. Reduce Miss Penalty: Subblock Placement3. Reduce Miss Penalty: Early Restart and Critical Word First4. Reduce Miss Penalty: Non-blocking Caches to reduce stalls on missesValue of Hit Under Miss for SPEC5th Miss PenaltyComparing Local and Global Miss RatesReducing Misses: Which apply to L2 Cache?L2 cache block size & A.M.A.T.Reducing Miss Penalty SummaryWhat is the Impact of What You’ve Learned About Caches?Cache Optimization SummaryJDK.F98 Slide 1Lecture 12: Caches IProf. John KubiatowiczComputer Science 252Fall 1998JDK.F98 Slide 2Review: Genetic Programming for Design•Genetic programming has two key aspects:–An Encoding of the design space.»This is a symbolic representation of the result space (genome).»Much of the domain-specific knowledge and “art” involved here.–A Reproduction strategy»Includes a method for generating offspring from parentsMutation: Changing random portions of an individualCrossover: Merging aspects of two individuals»Includes a method for evaluating the effectiveness (“fitness”) of individual solutions.•Generation of new branch predictors via genetic programming:–Everything derived from a “basic” predictor (table) + simple operators.–Expressions arranged in a tree–Mutation: random modification of node/replacement of subtree–Crossover: swapping the subtrees of two parents.JDK.F98 Slide 3Review: Who Cares About the Memory Hierarchy?µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)Performance“Moore’s Law”•Processor Only Thus Far in Course:–CPU cost/performance, ISA, Pipelined Execution CPU-DRAM Gap•1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)JDK.F98 Slide 4Processor-Memory Performance Gap “Tax” Processor % Area %Transistors ( cost) ( power)•Alpha 21164 37% 77%•StrongArm SA110 61% 94%•Pentium Pro 64% 88%–2 dies per package: Proc/I$/D$ + L2$•Caches have no inherent value, only try to close performance gapJDK.F98 Slide 5What is a cache?•Small, fast storage used to improve average access time to slow memory.•Exploits spacial and temporal locality•In computer architecture, almost everything is a cache!–Registers a cache on variables–First-level cache a cache on second-level cache–Second-level cache a cache on memory–Memory a cache on disk (virtual memory)–TLB a cache on page table–Branch-prediction a cache on prediction information?Proc/RegsL1-CacheL2-CacheMemoryDisk, Tape, etc.Bigger FasterJDK.F98 Slide 6 Generations of Microprocessors •Time of a full cache miss in instructions executed:1st Alpha (7000): 340 ns/5.0 ns = K68 clks x 2 or 1362nd Alpha (8400): 266 ns/3.3 ns = K80 clks x 4 or 3203rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648•1/2X latency x 3X clock rate x 3X Instr/clock 5XJDK.F98 Slide 7What happens on a Cache miss?•For in-order pipeline, 2 options:–Freeze pipeline in Mem stage (popular early on: Sparc, R4000)IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr–Use Full/Empty bits in registers + MSHR queue»MSHR = “Miss Status/Handler Registers” (Kroft)Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line.•Per cache-line: keep info about memory address.•For each word: register (if any) that is waiting for result.•Used to “merge” multiple requests to one memory line»New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline.»Attempt to use register before result returns causes instruction to block in decode stage.»Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures.•Out-of-order pipelines already have this functionality built in… (load queues, etc).JDK.F98 Slide 8Review: Four Questions for Memory Hierarchy Designers•Q1: Where can a block be placed in the upper level? (Block placement)–Fully Associative, Set Associative, Direct Mapped•Q2: How is a block found if it is in the upper level? (Block identification)–Tag/Block•Q3: Which block should be replaced on a miss? (Block replacement)–Random, LRU•Q4: What happens on a write? (Write strategy)–Write Back or Write Through (with Write Buffer)JDK.F98 Slide 9Review: Cache PerformanceCPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle timeMemory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)Memory stall clock cycles = Memory accesses x Miss rate x Miss penaltyNote: memory hit time is included in execution cycles.JDK.F98 Slide 10Review: Cache PerformanceCPUtime = Instruction Count x (CPIexecution + Mem accesses/inst x Miss rate x Miss penalty) x Clock cycle timeMisses per instruction = Memory accesses/inst x Miss rateCPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle timeJDK.F98 Slide 11Review: Improving Cache Performance1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.JDK.F98 Slide 12Reducing Misses•Classifying Misses: 3 Cs–Compulsory—The first access to a block is not in the cache, so the block must be brought
View Full Document