Rice ELEC 525 - Tradeoff Between coverage of a Markov prefetcher and memory - D2925458

Home> Schools> Rice University> Dissertation/Thesis Submission (ELEC) > ELEC 525> Tradeoff Between coverage of a Markov prefetcher and memory

Rice ELEC 525 - Tradeoff Between coverage of a Markov prefetcher and memory

School name Rice University

Course Elec 525- Advanced Microprocessor Architecture

Pages 11

Download Save

Unformatted text preview:

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña2 Hypothesis Some modern processors use a prefetching unit at the front-end of the machine to fetch instructions and data into the cache ahead of their use in order to reduce the latency of going to main memory if those accesses were to originally miss in the cache. The prefetcher guesses what the next address of access is going to be. Therefore, a good prefetcher should prefetch only those addresses that are actually used and not waste memory bandwidth by prefetching extraneous addresses. The paper “Prefetching with Markov Predictors” by D. Joseph and D. Grunwald introduces us to the Markov prefetcher, which is a correlation-based prefetcher that uses the history and probability of previous memory accesses to build a table of future memory fetches[1]. Figure 1 shows a miss reference stream and how that stream can be depicted as a Markov chain. The predictor proposed by Joseph and Grunwald places the information from the chain into a table which can be used to predict the next memory reference. Figure 1: An address reference string markov model of the reference pattern[1] The index to the Markov table is the current address. Each index (or entry) has an associated number of next states, or future addresses, to fetch when the search of the Markov table results in a “hit”. These entries also have associated data about their probability of being the correct next state. This data is in the form of hit counters for each next state. In an ideal Markov prefetcher, there are an unlimited number of next states to fetch, and the number of entries in the table is infinite. A real implementation must limit these two variables. In our study, we investigate the tradeoff between the number of next states and number of entries in the Markov table versus the coverage and accuracy of the prefetches to determine the most efficient configuration for our implementation. We will also look for ways to make the Markov prefetcher use less memory bandwidth by limiting the number of next states prefetched to be less than the number of next states in the table. Similar to Joseph and Grunwald, we explore the addition of a stride prefetcher in front of the Markov prefetcher in an attempt to filter out patterns that are more easily found by the stride prefetcher and thus reduce the mispredictions from the Markov prefecher. Our implementation of the stride3 prefetcher is simplified from the one proposed by Grunwald and shows a significant gain in the overall coverage and a reduction in the mispredicted prefetches. Markov Prefetcher Architecture The Markov table and prefetch logic are inserted off-chip between L2 cache and main memory. See Figure 2. Upon an L2 cache miss, the Markov table is searched. If the table search results in a “hit”, all valid next states (addresses) are loaded into the on-chip Markov prefetch “buffer” (32 entries in our implementation) at the same level as L2 cache. The on-chip Markov prefetch buffer is searched during an L1 cache miss at the same time as L2 cache is searched. If the current address hits in the prefetch buffer, then the valid next states in the Markov are loaded into the prefetch buffer. An address can only acquire one valid next state for each time it is seen in the address stream. Thus if we have only seen an address once previously, then we do not send out 4 prefetches for the next state when we see that address again, since the table only knows about one possible next state for that address. Regardless of the results of searching the prefetch buffer, if the address misses in the L2 it is forwarded to the Markov prefetcher as well as to main memory. If the address hits in the Markov table, then again the next states in the table are loaded into the prefetch buffer, replacing those entries that are the oldest in the prefetch buffer. We use a least recently used (LRU) replacement policy for our Markov prefetcher. Each time around, if there is a hit in the prefetch buffer, the LRU status of that entry in the Markov is updated. The entries are not stored in address order in the Markov table. Hence, when an entry is evicted, the new entry merely overwrites the old one, and no reshuffling or shifting of the entries occurs. Figure 2: Diagram of memory hierarchy with Markov prefetcher inserted In our implementation, we use a direct mapped, 8KB L1 instruction cache and 8KB L1 data cache. Each L1 cache block is 32-bytes with single cycle latency.4 Our L2 cache is a direct mapped, 1MB unified instruction/data cache with a block size of 128-bytes and 12-cycle latency. The Markov paper uses an L2 cache of 4MB in size, but the paper was written in 1997 and used SPEC95 benchmarks. We changed the size of the L2 cache to 1MB because the size of the data sets in the SPEC2000 benchmarks fit too well in 4MB of cache, thus we were getting a sparse number of L2 cache misses to prove our concept with. We think this is a reasonable change because most modern day year 2005 processors don’t necessarily have as much as 4MB of L2 cache on-chip. Grunwald and Joseph implemented their Markov prefetcher for L1 cache misses; we implemented ours for L2 cache misses. See Figure 3 for the Markov paper’s implementation. From Figure 3 it appears that L2 cache and main memory are off-chip. Modern day processors have L2 cache on chip, so in our implementation we use L2 cache misses as our miss reference trace. Moving the prefetcher beyond the L2 cache would seem to cause a decrease in accuracy since the address stream would be interleaved with I and Dcache accesses which may not exhibit the patterns that arise looking at a single cache as in [1]. Figure 3: Memory hierarchy with Markov prefetchers inserted[1] The size of the Markov prefetch table in the paper is fixed at 1MB (2^20). The ideal number of next states as determined by the Markov paper is four. If each next state (address) is 4 bytes, then each table entry (index) plus four next states is 2^2*5. Therefore the number of indices (entries) in the paper’s Markov table is (2^20) / (2^2*5) = 2^15 ~ 2^16, or roughly 50k entries. The paper also simulates using 1, 2, 4 and 8 next states, where the number of table entries varies dynamically while keeping the table size fixed at 1MB. In our experiments, we vary our table size, with our

View Full Document


School:
Email:
New Password:
Confirm Password:

Rice ELEC 525 - Tradeoff Between coverage of a Markov prefetcher and memory

Sign up for free to view:

Please select your school