Quantifying Load Stream BehaviorSuleyman Sair Timothy Sherwood Brad CalderDepartment of Computer Science and EngineeringUniversity of California, San Diegofssair,sherwood,[email protected] increasing performance gap between processors andmemory will force future architectures to devote significantresources towards removing and hiding memory latency. Thetwo major architectural features used to address this grow-ing gap are caches and prefetching.In this paper we perform a detailed quantification of thecache miss patterns for the Olden benchmarks, SPEC 2000benchmarks, and a collection of pointer based applications.We classify misses into one of four categories correspond-ing to the type of access pattern. These are next-line, stride,same-object (additional misses that occur to a recently ac-cessed object), or pointer-based transitions. We then pro-pose and evaluate a hardware profiling architecture to cor-rectly identify which type of access pattern is being seen.This access pattern identification could be used to help guideand allocate prefetching resources, and provide informationto feedback-directed optimizations.A second goal of this paper is to identify a suite of chal-lenging pointer-based benchmarks that can be used to fo-cus the development of new software and hardware prefetch-ing algorithms, and identify the challenges in performingprefetching for these applications using new metrics.1 IntroductionOne of the most important impediments to current and futureprocessor performance is the memory bottleneck. Proces-sor clock speeds are increasing at an exponential rate, muchfaster than DRAM access time improvements [21]. Cachememory hierarchies and data prefetching are the two primarytechniques used in current processors to try and hide or elim-inate memory latency. Memory latency can be removed ifdata is found to be in the cache. For data not in the cache,data prefetching is used to hide the latency by beginning thefetch before it is needed.Several models have been proposed for prefetching datato reduce or eliminate load latency. These range from in-serting compiler-based prefetches to pure hardware-baseddata prefetching. Compiler-based prefetching annotatesload instructions or inserts explicit prefetch instructions tobring data into the cache before it is needed to hide theload latency. These approaches use locality analysis toinsert prefetch instructions, showing significant improve-ments [14]. Hardware-based prefetching approaches are ableto dynamically predict address streams and prefetch downthem in ways that may be hard to do using compiler anal-ysis. In [18], we presented a prefetching architecture thatuses a predictor-directed stream buffer to prefetch down datamiss streams independent of the instruction stream being ex-ecuted. Other hardware schemes attempt to pre-compute thecomputation kernel on a separate thread or co-processor toreduce the memory latency [1, 6, 12].We first show how to classify load miss streams into dif-ferent classes based on their miss access patterns. We showfor a large number of programs what types of accesses arecausing misses so that they may be targeted for future re-search. We further show how this classification can be doneefficiently in hardware with a high degree of accuracy, sothat architectural structures such as the caches or prefetch-ing engines can be made access pattern aware. We classifythese loads into four types of access patterns or streams – (1)next-line, (2) stride, (3) same-object (additional misses to arecently referenced heap object), or (4) pointer-based misses.Out of these four types of cache miss streams, pointer-based streams can be the most difficult to eliminate usingexisting hardware and software prefetching algorithms. Tobetter understand the behavior of these loads and their appli-cations we examine two new metrics. The first metric, Ob-ject Fan Out, is used to quantify the number of pointers in anobject that are transitioned and frequently miss in the cache.The second metric, Pointer Variability, quantifies how manypointer transitions are stable versus how many are frequentlychanging. A pointer transition is a load that loads a pointer. Pointer variability shows how many times a pointer transi-tion for a given address loads a pointer different from thelast pointer that was loaded from that address. Programs withlow object fan out and pointer variability will be much easierto prefetch, in comparison to programs that have high objectfan out and variability, and we show that a large percentageof misses in real programs fall into this second category.An additional goal of this paper is to compare the be-havior of Olden, SPEC 2000, and set of additional pointer-based benchmarks. This is to identify a suite of challeng-ing pointer-based benchmarks that can be used to focus theProceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA’02) 1503-0897/02 $17.00 © 2002 IEEEProgram Descriptionburg A program that generates a fast tree parser using BURStechnology. It is commonly used to construct optimal in-struction selectors for use in compiler code generation.The input used was a grammar that scribes the VAX in-struction set architecture.deltablue A constraint solution system implemented in C++. It hasan abundance of short lived heap objects.dot Dot is taken from the AT&T’s GraphViz suite. It is a toolfor automatically making hierarchical layouts of directedgraphs. Automatic generation of graph drawings has im-portant applications in key technologies such as databasedesign, software engineering, VLSI and network designand visual interfaces in other domains.equake Equake is from the SPEC 2000 benchmark suite. Theprogram simulates the propagation of elastic waves inlarge, highly heterogeneous valleys, such as California’sSan Fernando Valley, or the Greater Los Angeles Basin.The goal is to recover the time history of the ground mo-tion everywhere within the valley due to a specific seis-mic event. Computations are performed on an unstruc-tured mesh that locally resolves wavelengths, using a fi-nite element method.mcf Mcf is from the SPEC 2000 benchmark suite. It isa combinatorial optimization algorithm solving a mini-mum cost network flow problem.sis Synthesis of synchronous and asynchronous circuits. Itincludes a number of capabilities such as state mini-mization and optimization. The program has approxi-mately 172,000 lines of source code and
View Full Document