Berkeley COMPSCI 252 - Lecture 4 Cache Design - D248945

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Lecture 4 Cache Design

DOC PREVIEW

Berkeley COMPSCI 252 - Lecture 4 Cache Design

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Page 1CS252/CullerLec 4. 11/31/02CS252Graduate Computer ArchitectureLecture 4Cache DesignJanuary 31, 2002Prof. David CullerCS252/CullerLec 4. 21/31/02CPU-DRAM Gap• 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)Who Cares About the Memory Hierarchy?µProc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)Performance“Moore’s Law”“Less’ Law?”CS252/CullerLec 4. 31/31/02Generations of Microprocessors • Time of a full cache miss in instructions executed:1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 1362nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 3203rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648• 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ -5XCS252/CullerLec 4. 41/31/02Processor-Memory Performance Gap “Tax”Processor % Area %Transistors (-cost) (-power)• Alpha 21164 37% 77%• StrongArm SA110 61% 94%• Pentium Pro 64% 88%– 2 dies per package: Proc/I$/D$ + L2$• Caches have no “inherent value”, only try to close performance gapCS252/CullerLec 4. 51/31/02What is a cache?• Small, fast storage used to improve average access time to slow memory.• Exploits spacial and temporal locality• In computer architecture, almost everything is a cache!– Registers “a cache” on variables – software managed– First-level cache a cache on second-level cache– Second-level cache a cache on memory– Memory a cache on disk (virtual memory)– TLB a cache on page table– Branch-prediction a cache on prediction information?Proc/RegsL1-CacheL2-CacheMemoryDisk, Tape, etc.Bigger FasterCS252/CullerLec 4. 61/31/02Traditional Four Questions for Memory Hierarchy Designers• Q1: Where can a block be placed in the upper level? (Block placement)– Fully Associative, Set Associative, Direct Mapped• Q2: How is a block found if it is in the upper level?(Block identification)– Tag/Block• Q3: Which block should be replaced on a miss? (Block replacement)– Random, LRU• Q4: What happens on a write? (Write strategy)– Write Back or Write Through (with Write Buffer)Page 2CS252/CullerLec 4. 71/31/02What are all the aspects of cache organization that impact performance?CS252/CullerLec 4. 81/31/02• Miss-oriented Approach to Memory Access:– CPIExecutionincludes ALU and Memory instructionsCycleTimeyMissPenaltMissRateInstMemAccessExecutionCPIICCPUtime ×××+×=CycleTimeyMissPenaltInstMemMissesExecutionCPIICCPUtime ××+×=Review: Cache performance• Separating out Memory component entirely– AMAT = Average Memory Access Time– CPIALUOpsdoes not include memory instructionsCycleTimeAMATInstMemAccessCPIInstAluOpsICCPUtimeAluOps××+××=yMissPenaltMissRateHitTimeAMAT×+=()( )DataDataDataInstInstInstyMissPenaltMissRateHitTimeyMissPenaltMissRateHitTime×++×+= CS252/CullerLec 4. 91/31/02Impact on Performance• Suppose a processor executes at – Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 – 50% arith/logic, 30% ld/st, 20% control• Suppose that 10% of memory operations get 50 cycle miss penalty• Suppose that 1% of instructions get same miss penalty• CPI = ideal CPI + average stalls per instruction1.1(cycles/ins) +[ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] +[ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1 • 58% of the time the proc is stalled waiting for memory!• AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54CS252/CullerLec 4. 101/31/02Unified vs Split Caches• Unified vs Separate I&D• Example:– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%– 32KB unified: Aggregate miss rate=1.99%• Which is better (ignore L2 cache)?– Assume 33% data ops ⇒ 75% accesses from instructions (1.0/1.33)– hit time=1, miss time=50– Note that data hit has 1 stall for unified cache (only one port)AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24ProcI-Cache-1ProcUnifiedCache-1UnifiedCache-2D-Cache-1ProcUnifiedCache-2CS252/CullerLec 4. 111/31/02How to Improve Cache Performance?1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache. yMissPenaltMissRateHitTimeAMAT×+=CS252/CullerLec 4. 121/31/02Where to misses come from?• Classifying Misses: 3 Cs– Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)– Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)– Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded andlater retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)• 4th “C”:– Coherence - Misses caused by cache coherence.Page 3CS252/CullerLec 4. 131/31/02Cache Size (KB) 00.020.040.060.080.10.120.1412481632641281-way2-way4-way8-wayCapacity Compulsory 3Cs Absolute Miss Rate (SPEC92)ConflictCS252/CullerLec 4. 141/31/02Cache Size• Old rule of thumb: 2x size => 25% cut in miss rate• What does it reduce?Cache Size (KB) 00.020.040.060.080.10.120.1412481632641281-way2-way4-way8-wayCapacity Compulsory CS252/CullerLec 4. 151/31/02Huge Caches => Working Sets024681012141248163264128256512102420484096Per Processor Cache Size (KB)Miss Rate (%)4-node8-node16-node32-nodeFirst working setCapacity -generated traffic(including conflicts)Second working setData trafficOther capacity -independent communicationCold-start (compulsory) trafficReplication capacity (cache size)Inherent communicationExample LU Decompositionfrom NAS Parallel BenchmarksCS252/CullerLec 4. 161/31/02Cache Organization?• Assume total cache size not changed:• What happens if:1) Change Block Size: 2) Change Associativity: 3) Change Compiler: Which of 3Cs is obviously affected?CS252/CullerLec 4. 171/31/02Block Size (bytes) Miss Rate 0%5%10%15%20%25%1632641282561K4K16K64K256KLarger Block Size (fixed size&assoc)Reduced compulsorymissesIncreasedConflictMissesWhat

View Full Document

Berkeley COMPSCI 252 - Lecture 4 Cache Design

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Berkeley COMPSCI 252 - Lecture 4 Cache Design

Sign up for free to view:

Please select your school