DOC PREVIEW
CMU CS 15740 - Quantifying Load Stream Behavior

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Quantifying Load Stream BehaviorSuleyman Sair Timothy Sherwood Brad CalderDepartment of Computer Science and EngineeringUniversity of California, San Diegofssair,sherwood,[email protected] increasing performance gap between processors andmemory will force future architectures to devote significantresources towards removing and hiding memory latency. Thetwo major architectural features used to address this grow-ing gap are caches and prefetching.In this paper we perform a detailed quantification of thecache miss patterns for the Olden benchmarks, SPEC 2000benchmarks, and a collection of pointer based applications.We classify misses into one of four categories correspond-ing to the type of access pattern. These are next-line, stride,same-object (additional misses that occur to a recently ac-cessed object), or pointer-based transitions. We then pro-pose and evaluate a hardware profiling architecture to cor-rectly identify which type of access pattern is being seen.This access pattern identification could be used to help guideand allocate prefetching resources, and provide informationto feedback-directed optimizations.A second goal of this paper is to identify a suite of chal-lenging pointer-based benchmarks that can be used to fo-cus the development of new software and hardware prefetch-ing algorithms, and identify the challenges in performingprefetching for these applications using new metrics.1 IntroductionOne of the most important impediments to current and futureprocessor performance is the memory bottleneck. Proces-sor clock speeds are increasing at an exponential rate, muchfaster than DRAM access time improvements [21]. Cachememory hierarchies and data prefetching are the two primarytechniques used in current processors to try and hide or elim-inate memory latency. Memory latency can be removed ifdata is found to be in the cache. For data not in the cache,data prefetching is used to hide the latency by beginning thefetch before it is needed.Several models have been proposed for prefetching datato reduce or eliminate load latency. These range from in-serting compiler-based prefetches to pure hardware-baseddata prefetching. Compiler-based prefetching annotatesload instructions or inserts explicit prefetch instructions tobring data into the cache before it is needed to hide theload latency. These approaches use locality analysis toinsert prefetch instructions, showing significant improve-ments [14]. Hardware-based prefetching approaches are ableto dynamically predict address streams and prefetch downthem in ways that may be hard to do using compiler anal-ysis. In [18], we presented a prefetching architecture thatuses a predictor-directed stream buffer to prefetch down datamiss streams independent of the instruction stream being ex-ecuted. Other hardware schemes attempt to pre-compute thecomputation kernel on a separate thread or co-processor toreduce the memory latency [1, 6, 12].We first show how to classify load miss streams into dif-ferent classes based on their miss access patterns. We showfor a large number of programs what types of accesses arecausing misses so that they may be targeted for future re-search. We further show how this classification can be doneefficiently in hardware with a high degree of accuracy, sothat architectural structures such as the caches or prefetch-ing engines can be made access pattern aware. We classifythese loads into four types of access patterns or streams – (1)next-line, (2) stride, (3) same-object (additional misses to arecently referenced heap object), or (4) pointer-based misses.Out of these four types of cache miss streams, pointer-based streams can be the most difficult to eliminate usingexisting hardware and software prefetching algorithms. Tobetter understand the behavior of these loads and their appli-cations we examine two new metrics. The first metric, Ob-ject Fan Out, is used to quantify the number of pointers in anobject that are transitioned and frequently miss in the cache.The second metric, Pointer Variability, quantifies how manypointer transitions are stable versus how many are frequentlychanging. A pointer transition is a load that loads a pointer. Pointer variability shows how many times a pointer transi-tion for a given address loads a pointer different from thelast pointer that was loaded from that address. Programs withlow object fan out and pointer variability will be much easierto prefetch, in comparison to programs that have high objectfan out and variability, and we show that a large percentageof misses in real programs fall into this second category.An additional goal of this paper is to compare the be-havior of Olden, SPEC 2000, and set of additional pointer-based benchmarks. This is to identify a suite of challeng-ing pointer-based benchmarks that can be used to focus theProceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA’02) 1503-0897/02 $17.00 © 2002 IEEEProgram Descriptionburg A program that generates a fast tree parser using BURStechnology. It is commonly used to construct optimal in-struction selectors for use in compiler code generation.The input used was a grammar that scribes the VAX in-struction set architecture.deltablue A constraint solution system implemented in C++. It hasan abundance of short lived heap objects.dot Dot is taken from the AT&T’s GraphViz suite. It is a toolfor automatically making hierarchical layouts of directedgraphs. Automatic generation of graph drawings has im-portant applications in key technologies such as databasedesign, software engineering, VLSI and network designand visual interfaces in other domains.equake Equake is from the SPEC 2000 benchmark suite. Theprogram simulates the propagation of elastic waves inlarge, highly heterogeneous valleys, such as California’sSan Fernando Valley, or the Greater Los Angeles Basin.The goal is to recover the time history of the ground mo-tion everywhere within the valley due to a specific seis-mic event. Computations are performed on an unstruc-tured mesh that locally resolves wavelengths, using a fi-nite element method.mcf Mcf is from the SPEC 2000 benchmark suite. It isa combinatorial optimization algorithm solving a mini-mum cost network flow problem.sis Synthesis of synchronous and asynchronous circuits. Itincludes a number of capabilities such as state mini-mization and optimization. The program has approxi-mately 172,000 lines of source code and


View Full Document

CMU CS 15740 - Quantifying Load Stream Behavior

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download Quantifying Load Stream Behavior
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Quantifying Load Stream Behavior and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Quantifying Load Stream Behavior 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?