Dead Block Prediction Dead Block Correlating Prefetchers An Chow Lai Electrical Computer Engineering Purdue University West Lafayette IN 47907 laia ecn purdue edu Cem Fide Sun Microsystems 901 San Antonio Rd Palo Alto CA 94303 cem fide eng sun com Babak Falsafi Electrical Computer Engineering Carnegie Mellon University Pittsburgh PA 15213 babak ece cmu edu http www ece cmu edu impetus with non blocking caches 19 allow overlapping the miss latency among the higher cache levels limited available instruction level parallelism and long access latencies to lower cache levels often expose the miss latency in many important classes of applications Many architects have additionally relied on the prefetch memory access model to mitigate the shortcomings of the demand fetch model Prefetching helps fetch data in advance to hide the memory latency by predicting future memory requests While prefetching can be initiated in either hardware 17 5 10 3 15 4 6 or software 9 8 14 12 many researchers and vendors opt for hardware implementations for transparency and due to availability of runtime information which can significantly improve prefetching s effectiveness Most previous proposals for hardware prefetchers target specific memory access patterns such as strided accesses 15 4 6 and accesses to linked data structures 17 While effective for the targeted access patterns these prefetchers have limited general applicability across a wide spectrum of applications There are a number of prefetcher proposals in the literature that target generalized memory access patterns 5 3 including strided accesses and indirect accesses to linked data structures and arrays These proposals primarily rely on miss address correlation 1 as a technique to predict and prefetch memory addresses These prefetchers which we refer to as Miss Correlating Prefetchers MCPs record a history of prior L1 cache miss addresses and correlate the history to a subsequent miss to trigger a prefetch Unfortunately MCPs suffer from several key shortcomings First LI cache misses are often clustered especially in out of order engines with high bandwidth L I caches significantly limiting the lookahead and opportunity for timely prefetching Second rather than predicting block evictability these prefetchers place the prefetched data in small associative buffers and look them up either in parallel with L1 thereby increasing L l s critical access path or upon an L1 miss thereby increasing the prefetch hit latency Finally miss address correlation has not been shown to offer both high prediction accuracy i e correct predictions as a fraction of all predictions and high coverage i e cor Abstract Effective data prefetching requires accurate mechanisms to predict both which cache blocks to prefetch and when to prefetch them This paper proposes the DeadBlock Predictors DBPs trace based predictors that accurately identify when an L1 data cache block becomes evictable or dead Predicting a dead block significantly enhances prefetching lookahead and opportunity and enables placing data directly into L1 obviating the need for auxiliary prefetch buffers This paper also proposes Dead Block Correlating Prefetchers DBCPs that use address correlation to predict which subsequent block to prefetch when a block becomes evictable A DBCP enables effective data prefetching in a wide spectrum of pointerintensive integer and floating point applications We use cycle accurate simulation of an out of order superscalar processor and memory intensive benchmarks to show that 1 dead block prediction enhances prefetching lookahead at least by an order of magnitude as compared to previous techniques 2 a DBP can predict dead blocks on average with a coverage of 90 only mispredicting 4 of the time 3 a DBCP offers an address prediction coverage of 86 only mispredicting 3 of the time and 4 DBCPs improve performance by 62 on average and 282 at best in the benchmarks we studied 1 Introduction Increasing processor clock speeds along with microarchitectural innovation have led to a tremendous gap between processor and memory performance Architects have primarily relied on deeper cache hierarchies where each level trades off faster lookup speed for larger capacity to reduce this performance gap Conventional cache hierarchies employ a demand fetch memory access model in which data are fetched into higher levels upon processor requests Unfortunately the limited capacity in higher cache levels and the simple data placement mechanisms used in conventional hierarchies often result in high miss rates and reduce performance While superscalar engines 1063 6897 01 10 00 2001 IEEE 144 rect predictions as a fraction of all misses 5 This paper proposes the Dead Block Predictors DBPs and the Dead Block Correlating Prefetchers DBCPs A DBP is a novel hardware mechanism that predicts when a block in a data cache becomes evictable In a recent paper 7 we proposed trace based predictors that record a trace of shared memory references to predict a last reference to a cache block prior to an invalidation in a multiprocessor Similarly a DBP records a trace of memory references that accurately predict the lastreference to a block in an L1 data cache prior to the block s eviction A DBCP uses address correlation in conjunction with dead block traces to predict a subsequent address upon a dead block prediction Accurate predicton of a block s evictability enables timely prefetching of data directly into an L1 data cache We use a cycle accurate simulation of an aggressive outof order superscalar processor and a spectrum of memoryintensive benchmarks to show the following Id st C1 miss Id st C1 hit last touch parallel lookup Id st B Id st A1 miss Id st B1 miss prefetch C2 Id st C2 miss waiting dynamic stream of memory references correlation table At L1 miss B1 MCP A C2 refetchC2 F I G U R E 1 A Miss Correlating Prefetcher fully fetch and place data prior to a processor reference 1 an accurate memory address predictor to predict which data to prefetch and 2 and an accurate predictor of when to prefetch the data MCPs rely on correlating cache miss addresses to predict both which data to prefetch and when to prefetch it Figure 1 depicts the anatomy of an MCP An MCP uses a miss address predictor and a prefetch buffer Much as two level branch predictors the miss address predictor consists of two storage levels A history register maintains an encoding of the most recent miss addresses A correlation table organized as a
View Full Document
Unlocking...