AbstractCurrent techniques for prefetching linked data structures(LDS) exploit the work available in one loop iteration orrecursive call to overlap pointer chasing latency. Jump-pointers, which provide direct access to non-adjacentnodes, can be used for prefetching when loop and recur-sive procedure bodies are small and do not have sufficientwork to overlap a long latency. This paper describes aframework for jump-pointer prefetching (JPP) that sup-ports four prefetching idioms: queue, full, chain, and rootjumping and three implementations: software-only, hard-ware-only, and a cooperative software/hardware tech-nique. On a suite of pointer intensive programs, jump-pointer prefetching reduces memory stall time by 72% forsoftware, 83% for cooperative and 55% for hardware, pro-ducing speedups of 15%, 20% and 22% respectively.1 IntroductionLinked data structures (LDS) are common in many appli-cations, and their importance is growing with the spread ofobject-oriented programming. The popularity of LDSstems from their flexibility, not their performance. LDSaccess, often referred to as pointer-chasing, entails chainsof data dependent loads that serialize address generationand memory access. In traversing an LDS, these loadsoften form the program’s critical path. Consequently,when they miss in the cache, they can severely limit paral-lelism and degrade performance.Prefetching is one way to hide LDS load latency andrecover performance. Address prediction based tech-niques can generate addresses in non-serial fashion,prefetch nodes arbitrarily far ahead of their anticipated useand tolerate long latencies. However, LDS access streamsrarely display the high levels of arithmetic regularityrequired to support accurate address prediction.Recently proposed scheduling based techniques [11, 16]prefetch nodes serially but attack issue delays that aggra-vate serialized latencies by issuing LDS loads as soon astheir inputs are ready. Scheduling methods can pre-calcu-late LDS addresses accurately, but their pace is dictated bythe critical path through the pointer chain. Schedulingmethods are inadequate when the amount of work avail-able for overlapping with the critical chain is limited, dueto either a tight loop or a slow memory. Handling thesesituations, which will worsen as the processor/memoryspeed gap grows, requires a mechanism that can addressand prefetch arbitrary LDS nodes.To illustrate our point, Figure 1(a) shows a list traversalloop (e.g., for (l = list; l; l = l->next) ...) with the long latencyof the induction loads (instances of l = l->next) exposed.Scheduling methods hide this latency by issuing the induc-tion load early in the iteration (Figure 1(b)). For short iter-ations or long latencies (Figure 1(c)), an induction loadwill stall the next iteration no matter how early within itsown iteration it issues. For full efficiency, it must be over-lapped with work from multiple iterations.We present a method for overlapping LDS load latencywith the work of multiple iterations via the structured useof jump-pointers. Jump-pointers are used strictly forprefetching. Residing at some or all LDS nodes, theypoint to nodes that are likely to be accessed in the nearfuture, not ones that are functionally adjacent. As shownin figure 1(d), jump-pointer prefetching (JPP) overcomesthe serial nature of LDS address generation and obtainsthe address of an otherwise non-adjacent LDS node via asingle low-latency lookup. This in turn allows us to over-lap the access latencies of multiple nodes, or equivalently,to overlap the latency of one node with multiple iterations.Our general framework combines jump-pointer prefetch-ing with chained prefetching, which uses the pointersavailable in the original unmodified program. We showthat jump-pointer prefetching and chained prefetching canbe combined in different ways to create four prefetchingidioms which we call queue jumping, full jumping, chainjumping and root jumping. Since both jump-pointerprefetching and chained prefetching can be implementedin either hardware or software, each idiom can be instanti-ated in one of three implementations: software, hardware,Figure 1. Hiding LDS load latency. (a) Exposedinduction load latency can be hidden by (b) schedulingit early in an iteration. (c) This approach is ineffectiveif a single iteration has insufficient work. (d) Jump-pointers can leverage the work of multiple iterations.(b)(d)(a)(c)computestallLDS induction loadloop iterationjump-pointer prefetchEffective Jump-Pointer Prefetching for Linked Data StructuresAmir Roth and Gurindar S. SohiComputer Sciences DepartmentUniversity of Wisconsin, Madison{amir, sohi}@cs.wisc.eduand cooperative. The cooperative scheme handles jump-pointer prefetching in software and chained prefetching inhardware.Each idiom/implementation combination has advantagesand drawbacks that make it suitable for certain kinds ofLDS traversals. We study a set of pointer intensive bench-marks and attempt to isolate the program features that bestguide idiom and implementation selection. Our experi-ments show that software, cooperative and hardwareprefetching eliminate an average of 72%, 83% and 55% ofthe total memory stall time in these programs, translatinginto speedups of 15%, 20%, and 22% respectively. This isa significant improvement over other known schemes.This rest of the paper is organized as follows. The nextsection presents our JPP framework and a benchmarkcharacterization. The three implementations are describedin Section 3 and evaluated in Section 4. The last sectionsdiscuss related and future work and our conclusions.2 Jump-pointer Prefetching FrameworkOur prefetching framework can be described in terms oftwo building blocks: jump-pointer prefetches and chainedprefetches. Jump-pointers are pointers added to the pro-gram’s data structures for prefetching purposes only. Wesay that a jump-pointer resides in a home node and pointsto a target node. Jump-pointer prefetches prefetch targetnodes using the jump-pointer at the home node. Forprefetching to succeed, the target of a jump-pointer mustpoint to a node that is likely to be referenced some timeafter the corresponding home node. Chained prefetches,on the other hand, do not require jump-pointers, theyprefetch using the original pointers in the structure. Eachof these types of prefetch provides different benefits andhas different associated performance costs. Jump-pointerprefetches can prefetch arbitrary LDS nodes, hide arbi-trary
View Full Document