DOC PREVIEW
Berkeley COMPSCI 252 - Caches and Memory Systems

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1CS252/KubiatowiczLec 4.11/26/01CS252Graduate Computer ArchitectureLecture 4Caches and Memory SystemsJanuary 26, 2001Prof. John KubiatowiczCS252/KubiatowiczLec 4.21/26/01Review: Who Cares About theMemory Hierarchy?µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)Performance“Moore’s Law”• Processor Only Thus Far in Course:– CPU cost/performance, ISA, Pipelined Execution CPU-DRAM Gap• 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)“Less’ Law?”CS252/KubiatowiczLec 4.31/26/01• Miss-oriented Approach to Memory Access:– CPIExecution includes ALU and Memory instructionsCycleTimeyMissPenaltMissRateInstMemAccessExecutionCPIICCPUtime ×××+×=CycleTimeyMissPenaltInstMemMissesExecutionCPIICCPUtime ××+×=Review: Cache performance• Separating out Memory component entirely– AMAT = Average Memory Access Time– CPIALUOps does not include memory instructionsCycleTimeAMATInstMemAccessCPIInstAluOpsICCPUtimeAluOps××+××=yMissPenaltMissRateHitTimeAMAT×+=()()DataDataDataInstInstInstyMissPenaltMissRateHitTimeyMissPenaltMissRateHitTime×++×+= CS252/KubiatowiczLec 4.41/26/01Review: Reducing Misses• Classifying Misses: 3 Cs– Compulsory—Misses in even an Infinite Cache– Capacity—Misses in Fully Associative Size X Cache– Conflict—Misses in N-way Associative, Size X Cache• More recent, 4th “C”:– Coherence - Misses caused by cache coherence.CS252/KubiatowiczLec 4.51/26/01Review: Miss Rate Reduction• 3 Cs: Compulsory, Capacity, Conflict1. Reduce Misses via Larger Block Size2. Reduce Misses via Higher Associativity3. Reducing Misses via Victim Cache4. Reducing Misses via Pseudo-Associativity5. Reducing Misses by HW Prefetching Instr, Data6. Reducing Misses by SW Prefetching Data7. Reducing Misses by Compiler Optimizations• Prefetching comes in two flavors:– Binding prefetch: Requests load directly into register.» Must be correct address and register!– Non-Binding prefetch: Load into cache.» Can be incorrect. Frees HW/SW to guess!yMissPenaltMissRateHitTimeAMAT×+=CS252/KubiatowiczLec 4.61/26/01Improving Cache PerformanceContinued1. Reduce the miss rate,2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.yMissPenaltMissRateHitTimeAMAT×+=Page 2CS252/KubiatowiczLec 4.71/26/01What happens on a Cache miss?• For in-order pipeline, 2 options:– Freeze pipeline in Mem stage (popular early on: Sparc, R4000)IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr– Use Full/Empty bits in registers + MSHR queue» MSHR = “Miss Status/Handler Registers” ( Kroft)Each entry in this queue keeps track of status of outstandingmemory requests to one complete memory line.• Per cache-line: keep info about memory address.• For each word: register (if any) that is waiting for result.• Used to “merge” multiple requests to one memory line» New load creates MSHR entry and sets destination register to“Empty”. Load is “released” from pipeline.» Attempt to use register before result returns causesinstruction to block in decode stage.» Limited “out-of-order” execution with respect to loads.Popular with in-order superscalar architectures.• Out-of-order pipelines already have this functionalitybuilt in… (load queues, etc).CS252/KubiatowiczLec 4.81/26/01Write Policy:Write-Through vs Write-Back• Write-through: all writes update cache and underlyingmemory/cache– Can always discard cached data - most up-to-date data is in memory– Cache control bit: only a valid bit• Write-back: all writes simply update cache– Can’t just discard cached data - may have to write it back to memory– Cache control bits: both valid and dirty bits• Other Advantages:– Write-through:» memory (or other processors) always have latest data» Simpler management of cache– Write-back:» much lower bandwidth, since data often overwritten multiple times» Better tolerance to long-latency memory?CS252/KubiatowiczLec 4.91/26/01Write Policy 2:Write Allocate vs Non-Allocate(What happens on write-miss)• Write allocate: allocate new cache line in cache– Usually means that you have to do a “read miss” tofill in rest of the cache-line!– Alternative: per/word valid bits• Write non-allocate (or “write-around”):– Simply send write data through to underlyingmemory/cache - don’t allocate new cache line!CS252/KubiatowiczLec 4.101/26/011. Reducing Miss Penalty:Read Priority over Write on MisswritebufferCPUin out DRAM (or lower mem)Write BufferCS252/KubiatowiczLec 4.111/26/011. Reducing Miss Penalty:Read Priority over Write on Miss• Write-through with write buffers offer RAW conflictswith main memory reads on cache misses– If simply wait for write buffer to empty, might increase read misspenalty (old MIPS 1000 by 50% )– Check write buffer contents before read;if no conflicts, let the memory access continue• Write-back also want buffer to hold misplaced blocks– Read miss replacing dirty block– Normal: Write dirty block to memory, and then do the read– Instead copy the dirty block to a write buffer, then do the read,and then do the write– CPU stall less since restarts as soon as do readCS252/KubiatowiczLec 4.121/26/012. Reduce Miss Penalty:Early Restart and Critical WordFirst• Don’t wait for full block to be loaded beforerestarting CPU– Early restart —As soon as the requested word of the blockarrives, send it to the CPU and let the CPU continue execution– Critical Word First—Request the missed word first from memoryand send it to the CPU as soon as it arrives; let the CPU continueexecution while filling the rest of the words in the block. Alsocalled wrapped fetch and requested word first• Generally useful only in large blocks,• Spatial locality a problem; tend to want nextsequential word, so not clear if benefit by earlyrestartblockPage 3CS252/KubiatowiczLec 4.131/26/013. Reduce Miss Penalty:Non-blocking Caches to reduce stallson misses• Non-blocking cache or lockup-free cache allow datacache to continue to supply cache hits during a miss– requires F/E bits on registers or out-of-order execution– requires multi-bank memories• “hit under miss” reduces the effective miss penaltyby working during


View Full Document

Berkeley COMPSCI 252 - Caches and Memory Systems

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Caches and Memory Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Caches and Memory Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Caches and Memory Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?