UMBC CMSC 611 - Cache and Memory - D2522494

Home> Schools> University of Maryland, Baltimore County> (CMSC) > CMSC 611> Cache and Memory

UMBC CMSC 611 - Cache and Memory

School name University of Maryland, Baltimore County

Course Cmsc 611- Advanced Computer Architecture

Pages 63

Download Save

Unformatted text preview:

CMSC 611: Advanced Computer ArchitectureCache and MemoryClassification of Cache Misses• Compulsory– The first access to a block is never in the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)• Capacity– If the cache cannot contain all the blocks needed during execution of a program, blocks must be discarded and later retrieved.(Misses in Fully Associative Size X Cache)• Conflict– If block-placement strategy is set associative or direct mapped, blocks may be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)Improving Cache Performance• Capacity misses can be damaging to the performance (excessive main memory access)• Increasing associativity, cache size and block width can reduces misses• Changing cache size affects both capacity and conflict misses since it spreads out references to more blocks• Some optimization techniques that reduces miss rate also increases hit access timeCache Size (KB) Miss Rate per Type00.020.040.060.080.10.120.1412481632641281-way2-way4-way8-wayCapacity Compulsory ConflictBased on SPEC92Miss Rate Distribution• Compulsory misses are very small compared to other categories• Capacity-based misses are diminishing with increased cache sizes• Increasing associativity limits the potential of placement conflicts† CPUtime = IC ¥ CPIExecution+Memory accessesInstruction¥ Miss rate¥Miss penaltyÊ Ë Á ˆ ¯ ˜ ¥ Clock cycle timeTechniques for Reducing Misses1. Reducing Misses via Larger Block Size2. Reducing Misses via Higher Associativity3. Reducing Misses via Victim Cache4. Reducing Misses via Pseudo-Associativity5. Reducing Misses by H/W Prefetching Instr. and Data6. Reducing Misses by S/W Prefetching Data7. Reducing Misses by Compiler OptimizationsSlide: Dave PattersonReduce Misses via Larger Block Size• Larger block sizes reduces compulsory misses (principle of spatial locality)• Conflict misses increase for larger block sizes since cache has fewer blocks• The miss penalty usually outweighs the decrease in the miss rate making large block sizes less favoredCache Size (KB) Miss Rate per Type00.020.040.060.080.10.120.1412481632641281-way2-way4-way8-wayCapacity Compulsory 2:1 Cache Rule: Miss Rate for direct mapped cache of size N = Miss Rate 2-way cache size N/2Reduce Misses via Higher Associativity• Greater associativity comes at the expense of larger hit access time• Hardware complexity grows for high associativity and clock cycle increasesAssociativity Cache Size (KB) 1-way 2-way 4-way 8-way 1 7.65 6.60 6.22 5.44 2 5.90 4.90 4.62 4.09 4 4.60 3.95 3.57 3.19 8 3.30 3.00 2.87 2.59 16 2.45 2.20 2.12 2.04 32 2.00 1.80 1.77 1.79 64 1.70 1.60 1.57 1.59 128 1.50 1.45 1.42 1.44 Assume hit time is 1 clock cycle and average miss penalty is 50 clock cycles for a direct mapped cache. The clock cycle increases by a factor of 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way associative cache. Compare the average memory access based on the previous figure miss ratesHigh associativity becomes a negative aspectA good size of direct mapped cache can be very efficient given its simplicityExampleVictim Cache ApproachCPUaddressData Datain outWritebufferLower level memoryVictim cache=?=?DataTag• Combines fast hit time of direct mapped yet still avoids conflict misses– Adds small fully asssociative cache between the direct mapped cache and memory to place data discarded from cache– Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache– Technique is used in Alpha, HP machines and does not impair the clock rateHit TimePseudo Hit TimeMiss PenaltyTime• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Better for caches not tied directly to processor (L2)– Used in MIPS R1000 L2 cache, similar in UltraSPARCSlide: Dave PattersonPseudo-Associativity Mechanism• Combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way set associative cache • Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)• Simplest implementation inverts the most significant bit in the index field to find the other pseudo set• Pseudo associative caches has two hit times (hit and pseudo hit)• To limit the impact of hit time variability on performance, the contents of the blocks can be swappedAverage memory access timepre-fetch = Hit time + Miss rate ´ Pre-fetch hit rate ´ 1 + Miss rate ´ (1 - Pre-fetch hit rate) ´ Miss penalty Slide: Dave PattersonH/W Pre-fetching of Instructions & Data• The hardware pre-fetch instructions and data while handing other cache misses, assuming that the pre-fetched items will be referenced shortly• Pre-fetching relies on having extra memory bandwidth that can be used without penalty• Examples of Instruction Pre-fetching:– Alpha 21064 fetches 2 blocks on a miss– Extra block placed in “stream buffer”– On miss check stream buffer• Works with data blocks too:– Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43%– Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative cachesfor (i = 0; i < 3; i = i+1) for (j = 0; j < 100; j = j+1) a[i][j] = b[j][0] * b[j+1][0];for (j = 0; j < 100; j = j+1) pre-fetch (b[i+7][0]); a[0][j] = b[j][0] * b[j+1][0]; for (i = 1; i < 3; i = i+1) pre-fetch (a[i][j+7]); a[i-1][j] = b[j][0] * b[j+1][0];Software Pre-fetching Data• Uses special instructions to pre-fetch data:– Load data into register (HP PA-RISC loads)– Cache Pre-fetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)• Special pre-fetching instructions cannot cause faults (undesired exceptions) since it is a form of speculative execution• Makes sense if the processor can proceeds without blocking for a cache access (lock-free cache)• Loops are typical target for pre-fetching after unrolling (miss penalty is small) or after applying software pipelining (miss penalty is large)• Issuing Pre-fetch Instructions takes time– Is cost of pre-fetch issues < savings in reduced misses?– Higher superscalar reduces difficulty of issue bandwidthCompiler-based Cache Optimizations• Complier-based cache optimization reduces the miss rate without any hardware

View Full Document


School:
Email:
New Password:
Confirm Password:

UMBC CMSC 611 - Cache and Memory

Sign up for free to view:

Please select your school