U of U CS 7810 - Lecture 11 - Large Cache Design IV - D394981

Home> Schools> University of Utah> Computer Science (CS) > CS 7810> Lecture 11 - Large Cache Design IV

DOC PREVIEW

U of U CS 7810 - Lecture 11 - Large Cache Design IV

School name University of Utah

Course Cs 7810- Advanced Computer Architecture

Pages 14

This preview shows page 1-2-3-4-5 out of 14 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 141Lecture 11: Large Cache Design IV• Topics: prefetch, dead blocks, cache networks2Temporal Memory Streaming Wenisch et al., ISCA’05• When a thread incurs a series of misses to blocks (PQRS, not necessarily contiguous), other threads are likely to incur a similar series of misses• Each thread maintains its miss log in a circular buffer in memory; the directory for P also keeps track of a pointer to multiple log entries of P• When a thread has a miss on P, it contacts the directory and the directory provides the log pointers; the thread receives multiple streams and starts prefetching• Log access and prefetches are off the critical path3Spatial Memory Streaming Somogyi et al., ISCA’06• Threads often enter a new region (page) and touch a few arbitrary blocks in that region• A predictor is indexed with the PC of the first access to that region and the offset of the first access; the predictor returns a bit vector indicating the blocks accessed within that region• Can even prefetch for regions that have not been touched before!4Feedback Directed Prefetching Srinath et al., HPCA’07• A stream prefetcher has two parameters: P: prefetch distance: how far ahead of the start do we prefetch N: prefetch degree: how much do we advance the start when there is a hit in the stream• Can vary these two parameters based on pref effectiveness• Accuracy: a bit tracks if a prefetched block was touched • Timeliness: was the block touched while in the MSHR?• Pollution: track recent evictions (Bloom filter) and see if they are re-touched; also guides insertion policy5Dead Block Prediction• Can keep track of the number of accesses to a line during its previous residence; the block is deemed to be dead after that many accesses Kharbutli, Solihin, IEEE TOC’08• To reduce noise, an access can be considered as a block’s move to the MRU position Liu et al., MICRO 2008• Earlier DBPs used a trace of PCs to capture when a block has completed its use• DBP is used for energy savings, replacement policies, and cache bypassing6Distill Cache Qureshi, HPCA 2007• Half the ways are traditional (LOC); when a block is evicted from the LOC, only the touched words are stored in a word-organized cache that has many narrow ways• Incurs a fair bit of complexity (more tags for the WOC, collection of word touches in L1s, blocks with holes, etc.)• Does not need a predictor; actions are based on the block’s behavior during current residence• Useless word identification is orthogonal to cache compression7Traditional Networks Huh et al. ICS’05, Beckmann MICRO’04 Example designs for contiguous L2 cache regions8Explorations for Optimality Muralimanohar et al., ISCA’0793D Designs, Li et al., ISCA’06• D-NUCA: first search in cylinder, then multicast search everywhere• Data is migrated close to requester, but need not jump across layers10Halo Network Jin et al., HPCA’07• D-NUCA: Sets are distributed across columns; Ways are distributed across rows11Halo Network12Nahalal Guz et al., CAL’0713Nahalal• Block is initially placed in core’s private bank and then swapped into the shared bank if frequently accessed by other cores• Parallel search across all banks14Title•

View Full Document