Unformatted text preview:

1G22.2243-001High Performance Computer ArchitectureLecture 8Memory SystemMarch 8, 20063/9/2006 2Outline• Announcements– Final Exam: Wednesday, May 3 5:00 - 6:50pm, 101 WWH– Lab Assignment 3 out today. Due in 3 weeks on Mar. 29– Lab Assignment 2 due today– HW Assignment 3 due today• Last lecture: – VLIW architectures– Memory System• Memory System (Cont’d)[ Hennessy/Patterson CA:AQA (3rd Edition): Chapter 5 ]3/9/2006 3Memory Hierarchy Design(Moving Outside the Processor)Recap3/9/2006 4(Review) Cache Organization• Cache is the name given to the first level of the memory hierarchy, encountered once the address leaves the CPU– It serves as a temporary place where frequently-used values can be stored• Retains the same name as in memory (different from registers)–To avoid having to go to memory every time this value is needed• Caches are faster (hence more expensive, limited in size) than DRAM• Caches store values at the granularity of cache blocks (lines)– Larger than a single word: efficiency and spatial locality concerns– Cache hit if value in cache, else cache miss• Effect of caches on CPU execution timeCPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle timeMemory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)= Memory accesses x Miss rate x Miss penalty3/9/2006 5Four Questions for Memory Hierarchy DesignersQ1: Where can a block be placed in the upper level? (Block placement)– Fully Associative, Set Associative, Direct MappedQ2: How is a block found if it is in the upper level?(Block identification)– Tag per blockQ3: Which block should be replaced on a miss? (Block replacement)– Random, LRUQ4: What happens on a write? (Write strategy)– Write Back or Write Through (with Write Buffer)3/9/2006 6Question 1: Block Placement• Fully associative: block can be placed anywhere• Direct map: each block has one place• Set associative: block can be placed anywhere in a setRange of caches is really a continuum of levels of set associativityMost caches today are direct-mapped (1-way), 2-way or 4-way associative23/9/2006 7Question 2: Block Identification• Caches have a tag on each block frame that gives the block address– All possible tags, where the block may be present, are checked in parallel• Quick check of whether a block contains data: Valid bit• Organization determines which (subset of) blocks need to be checked– View memory address as below– Fully-associative caches: Only tagSelects the “set”Selects “block” within setLarger blocksLower associativity3/9/2006 8Question 3: Block Replacement• When a new block needs to be brought in (on demand), an existingcache block may need to be freed up• Three commonly-used schemes(we only select a block within the appropriate “set”)– Random: Easiest to implement– Least-recently used (LRU)– First-in, first-out (FIFO): used as an approximation to LRU• LRU outperforms Random and FIFO on smaller caches– FIFO outperforms Random• Differences not as big for larger caches– Bigger benefit from avoiding misses in the first place3/9/2006 9Question 4: Write Strategy• When is memory updated with the contents of a store?• Issue: Reads dominate cache traffic (writes typically 10% of accesses)– Optimization for read: Do tag checking and data transfer in parallel– Cannot do this for writes (also, only sub-portion of block needs update)• Two write policies– Write through• Information written to both cache and memory• Simplifies replacement procedure (block is clean)• Also, simplifies data coherency (later in the course)– Write back• Information only written to the cache•Dirtybit keeps track of which blocks have data that needs to be sync-ed• Multiple writes lead to less number of wrtes to memory• Reduces memory bandwidth requirement (hence power)– Variants: With or without write-allocate (usually used with write back)• Write stalls in write-through caches reduced using write buffers3/9/2006 10Improving Cache PerformanceCPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle timeMemory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty)= Memory accesses x Miss rate x Miss penalty• Above assumes 1-cycle to hit in cache– Hard to achieve in current-day processors (faster clocks, larger caches)– More reasonable to also include hit time in the performance equation Average memory access time = Hit Time + Miss rate x Miss PenaltySmall/simple cachesAvoiding address translationPipelined cache accessTrace cachesLarger block sizeLarger cache sizeHigher associativityWay predictionCompiler optimizationsMultilevel cachesCritical word firstRead miss before write missMerging write buffersVictim cachesNonblocking cachesHardware prefetchingCompiler prefetchingABCD3/9/2006 11A.1. Reducing Miss Penalty via Multilevel Caches• Idea: Have multiple levels of caches– Tradeoff between size (cache effectiveness) and cost (access time)• For a 2-level cacheAverage memory access time = Hit time (L1) + Miss rate (L1) x Miss penalty (L1)Miss penalty (L1) = Hit time (L2) + Miss rate (L2) x Miss penalty (L2)• Distinguish between two kinds of miss rates– Local miss rate = Miss rate (L1) or Miss rate (L2)– Global miss rate = Number of misses/total number of memory accesses= Miss rate (L1), but Miss rate (L1) x Miss rate(L2)• Example: 1000 references, 40 misses in L1 cache and 20 in L2– Local miss rates: 4% (L1), 50% (L2) = 20/40– Global miss rates: 4% (L1), 2% (L2)– Avg. memory access time = 1 + 4% x (10 + 50% x 100) = 3.4 cycles3/9/2006 12Multilevel Caches (cont’d)• Doesn’t make much sense to have L2 caches smaller than L1 caches• L2 needs to be significantly bigger to have reasonable miss rates– Cost of big L2 is smaller than big L1• Exclusive and cooperative caches64 KB L1 caches forinstruction, dataUnified L2 cache33/9/2006 13A.2. Reduce Miss Penalty via Critical Word First and Early Restart• Idea: Don’t wait for full block to be loaded before restarting CPU– Early restart: request the words in a block in order. As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution – Critical Word First: Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU


View Full Document

NYU CSCI-GA 2243 - Memory System

Download Memory System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Memory System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Memory System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?