Unformatted text preview:

Quantifying Load Stream Behavior Suleyman Sair Timothy Sherwood Brad Calder Department of Computer Science and Engineering University of California San Diego fssair sherwood calderg cs ucsd edu Abstract The increasing performance gap between processors and memory will force future architectures to devote signi cant resources towards removing and hiding memory latency The two major architectural features used to address this growing gap are caches and prefetching In this paper we perform a detailed quanti cation of the cache miss patterns for the Olden benchmarks SPEC 2000 benchmarks and a collection of pointer based applications We classify misses into one of four categories corresponding to the type of access pattern These are next line stride same object additional misses that occur to a recently accessed object or pointer based transitions We then propose and evaluate a hardware pro ling architecture to correctly identify which type of access pattern is being seen This access pattern identi cation could be used to help guide and allocate prefetching resources and provide information to feedback directed optimizations A second goal of this paper is to identify a suite of challenging pointer based benchmarks that can be used to focus the development of new software and hardware prefetching algorithms and identify the challenges in performing prefetching for these applications using new metrics 1 Introduction One of the most important impediments to current and future processor performance is the memory bottleneck Processor clock speeds are increasing at an exponential rate much faster than DRAM access time improvements 21 Cache memory hierarchies and data prefetching are the two primary techniques used in current processors to try and hide or eliminate memory latency Memory latency can be removed if data is found to be in the cache For data not in the cache data prefetching is used to hide the latency by beginning the fetch before it is needed Several models have been proposed for prefetching data to reduce or eliminate load latency These range from inserting compiler based prefetches to pure hardware based data prefetching Compiler based prefetching annotates load instructions or inserts explicit prefetch instructions to bring data into the cache before it is needed to hide the load latency These approaches use locality analysis to insert prefetch instructions showing signi cant improvements 14 Hardware based prefetching approaches are able to dynamically predict address streams and prefetch down them in ways that may be hard to do using compiler analysis In 18 we presented a prefetching architecture that uses a predictor directed stream buffer to prefetch down data miss streams independent of the instruction stream being executed Other hardware schemes attempt to pre compute the computation kernel on a separate thread or co processor to reduce the memory latency 1 6 12 We rst show how to classify load miss streams into different classes based on their miss access patterns We show for a large number of programs what types of accesses are causing misses so that they may be targeted for future research We further show how this classi cation can be done ef ciently in hardware with a high degree of accuracy so that architectural structures such as the caches or prefetching engines can be made access pattern aware We classify these loads into four types of access patterns or streams 1 next line 2 stride 3 same object additional misses to a recently referenced heap object or 4 pointer based misses Out of these four types of cache miss streams pointerbased streams can be the most dif cult to eliminate using existing hardware and software prefetching algorithms To better understand the behavior of these loads and their applications we examine two new metrics The rst metric Object Fan Out is used to quantify the number of pointers in an object that are transitioned and frequently miss in the cache The second metric Pointer Variability quanti es how many pointer transitions are stable versus how many are frequently changing A pointer transition is a load that loads a pointer Pointer variability shows how many times a pointer transition for a given address loads a pointer different from the last pointer that was loaded from that address Programs with low object fan out and pointer variability will be much easier to prefetch in comparison to programs that have high object fan out and variability and we show that a large percentage of misses in real programs fall into this second category An additional goal of this paper is to compare the behavior of Olden SPEC 2000 and set of additional pointerbased benchmarks This is to identify a suite of challenging pointer based benchmarks that can be used to focus the Proceedings of the Eighth International Symposium on High Performance Computer Architecture HPCA 02 1503 0897 02 17 00 2002 IEEE Program burg deltablue dot equake mcf sis vis Description A program that generates a fast tree parser using BURS technology It is commonly used to construct optimal instruction selectors for use in compiler code generation The input used was a grammar that scribes the VAX instruction set architecture A constraint solution system implemented in C It has an abundance of short lived heap objects Dot is taken from the AT T s GraphViz suite It is a tool for automatically making hierarchical layouts of directed graphs Automatic generation of graph drawings has important applications in key technologies such as database design software engineering VLSI and network design and visual interfaces in other domains Equake is from the SPEC 2000 benchmark suite The program simulates the propagation of elastic waves in large highly heterogeneous valleys such as California s San Fernando Valley or the Greater Los Angeles Basin The goal is to recover the time history of the ground motion everywhere within the valley due to a speci c seismic event Computations are performed on an unstructured mesh that locally resolves wavelengths using a nite element method Mcf is from the SPEC 2000 benchmark suite It is a combinatorial optimization algorithm solving a minimum cost network ow problem Synthesis of synchronous and asynchronous circuits It includes a number of capabilities such as state minimization and optimization The program has approximately 172 000 lines of source code and performs a lot of pointer arithmetic VIS Veri cation Interacting with Synthesis is a tool that


View Full Document

CMU CS 15740 - Quantifying Load Stream Behavior

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Quantifying Load Stream Behavior and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Quantifying Load Stream Behavior and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?