DOC PREVIEW
CMU CS 15740 - roth_isca99

This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

AbstractCurrent techniques for prefetching linked data structures(LDS) exploit the work available in one loop iteration orrecursive call to overlap pointer chasing latency. Jump-pointers, which provide direct access to non-adjacentnodes, can be used for prefetching when loop and recur-sive procedure bodies are small and do not have sufficientwork to overlap a long latency. This paper describes aframework for jump-pointer prefetching (JPP) that sup-ports four prefetching idioms: queue, full, chain, and rootjumping and three implementations: software-only, hard-ware-only, and a cooperative software/hardware tech-nique. On a suite of pointer intensive programs, jump-pointer prefetching reduces memory stall time by 72% forsoftware, 83% for cooperative and 55% for hardware, pro-ducing speedups of 15%, 20% and 22% respectively.1 IntroductionLinked data structures (LDS) are common in many appli-cations, and their importance is growing with the spread ofobject-oriented programming. The popularity of LDSstems from their flexibility, not their performance. LDSaccess, often referred to as pointer-chasing, entails chainsof data dependent loads that serialize address generationand memory access. In traversing an LDS, these loadsoften form the program’s critical path. Consequently,when they miss in the cache, they can severely limit paral-lelism and degrade performance.Prefetching is one way to hide LDS load latency andrecover performance. Address prediction based tech-niques can generate addresses in non-serial fashion,prefetch nodes arbitrarily far ahead of their anticipated useand tolerate long latencies. However, LDS access streamsrarely display the high levels of arithmetic regularityrequired to support accurate address prediction.Recently proposed scheduling based techniques [11, 16]prefetch nodes serially but attack issue delays that aggra-vate serialized latencies by issuing LDS loads as soon astheir inputs are ready. Scheduling methods can pre-calcu-late LDS addresses accurately, but their pace is dictated bythe critical path through the pointer chain. Schedulingmethods are inadequate when the amount of work avail-able for overlapping with the critical chain is limited, dueto either a tight loop or a slow memory. Handling thesesituations, which will worsen as the processor/memoryspeed gap grows, requires a mechanism that can addressand prefetch arbitrary LDS nodes.To illustrate our point, Figure 1(a) shows a list traversalloop (e.g., for (l = list; l; l = l->next) ...) with the long latencyof the induction loads (instances of l = l->next) exposed.Scheduling methods hide this latency by issuing the induc-tion load early in the iteration (Figure 1(b)). For short iter-ations or long latencies (Figure 1(c)), an induction loadwill stall the next iteration no matter how early within itsown iteration it issues. For full efficiency, it must be over-lapped with work from multiple iterations.We present a method for overlapping LDS load latencywith the work of multiple iterations via the structured useof jump-pointers. Jump-pointers are used strictly forprefetching. Residing at some or all LDS nodes, theypoint to nodes that are likely to be accessed in the nearfuture, not ones that are functionally adjacent. As shownin figure 1(d), jump-pointer prefetching (JPP) overcomesthe serial nature of LDS address generation and obtainsthe address of an otherwise non-adjacent LDS node via asingle low-latency lookup. This in turn allows us to over-lap the access latencies of multiple nodes, or equivalently,to overlap the latency of one node with multiple iterations.Our general framework combines jump-pointer prefetch-ing with chained prefetching, which uses the pointersavailable in the original unmodified program. We showthat jump-pointer prefetching and chained prefetching canbe combined in different ways to create four prefetchingidioms which we call queue jumping, full jumping, chainjumping and root jumping. Since both jump-pointerprefetching and chained prefetching can be implementedin either hardware or software, each idiom can be instanti-ated in one of three implementations: software, hardware,Figure 1. Hiding LDS load latency. (a) Exposedinduction load latency can be hidden by (b) schedulingit early in an iteration. (c) This approach is ineffectiveif a single iteration has insufficient work. (d) Jump-pointers can leverage the work of multiple iterations.(b)(d)(a)(c)computestallLDS induction loadloop iterationjump-pointer prefetchEffective Jump-Pointer Prefetching for Linked Data StructuresAmir Roth and Gurindar S. SohiComputer Sciences DepartmentUniversity of Wisconsin, Madison{amir, sohi}@cs.wisc.eduand cooperative. The cooperative scheme handles jump-pointer prefetching in software and chained prefetching inhardware.Each idiom/implementation combination has advantagesand drawbacks that make it suitable for certain kinds ofLDS traversals. We study a set of pointer intensive bench-marks and attempt to isolate the program features that bestguide idiom and implementation selection. Our experi-ments show that software, cooperative and hardwareprefetching eliminate an average of 72%, 83% and 55% ofthe total memory stall time in these programs, translatinginto speedups of 15%, 20%, and 22% respectively. This isa significant improvement over other known schemes.This rest of the paper is organized as follows. The nextsection presents our JPP framework and a benchmarkcharacterization. The three implementations are describedin Section 3 and evaluated in Section 4. The last sectionsdiscuss related and future work and our conclusions.2 Jump-pointer Prefetching FrameworkOur prefetching framework can be described in terms oftwo building blocks: jump-pointer prefetches and chainedprefetches. Jump-pointers are pointers added to the pro-gram’s data structures for prefetching purposes only. Wesay that a jump-pointer resides in a home node and pointsto a target node. Jump-pointer prefetches prefetch targetnodes using the jump-pointer at the home node. Forprefetching to succeed, the target of a jump-pointer mustpoint to a node that is likely to be referenced some timeafter the corresponding home node. Chained prefetches,on the other hand, do not require jump-pointers, theyprefetch using the original pointers in the structure. Eachof these types of prefetch provides different benefits andhas different associated performance costs. Jump-pointerprefetches can prefetch arbitrary LDS nodes, hide arbi-trary


View Full Document

CMU CS 15740 - roth_isca99

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download roth_isca99
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view roth_isca99 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view roth_isca99 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?