Effective Jump Pointer Prefetching for Linked Data Structures Amir Roth and Gurindar S Sohi Computer Sciences Department University of Wisconsin Madison amir sohi cs wisc edu Abstract Current techniques for prefetching linked data structures LDS exploit the work available in one loop iteration or recursive call to overlap pointer chasing latency Jumppointers which provide direct access to non adjacent nodes can be used for prefetching when loop and recursive procedure bodies are small and do not have sufficient work to overlap a long latency This paper describes a framework for jump pointer prefetching JPP that supports four prefetching idioms queue full chain and root jumping and three implementations software only hardware only and a cooperative software hardware technique On a suite of pointer intensive programs jumppointer prefetching reduces memory stall time by 72 for software 83 for cooperative and 55 for hardware producing speedups of 15 20 and 22 respectively of the induction loads instances of l l next exposed Scheduling methods hide this latency by issuing the induction load early in the iteration Figure 1 b For short iterations or long latencies Figure 1 c an induction load will stall the next iteration no matter how early within its own iteration it issues For full efficiency it must be overlapped with work from multiple iterations a loop iteration b c d 1 Introduction Linked data structures LDS are common in many applications and their importance is growing with the spread of object oriented programming The popularity of LDS stems from their flexibility not their performance LDS access often referred to as pointer chasing entails chains of data dependent loads that serialize address generation and memory access In traversing an LDS these loads often form the program s critical path Consequently when they miss in the cache they can severely limit parallelism and degrade performance Prefetching is one way to hide LDS load latency and recover performance Address prediction based techniques can generate addresses in non serial fashion prefetch nodes arbitrarily far ahead of their anticipated use and tolerate long latencies However LDS access streams rarely display the high levels of arithmetic regularity required to support accurate address prediction Recently proposed scheduling based techniques 11 16 prefetch nodes serially but attack issue delays that aggravate serialized latencies by issuing LDS loads as soon as their inputs are ready Scheduling methods can pre calculate LDS addresses accurately but their pace is dictated by the critical path through the pointer chain Scheduling methods are inadequate when the amount of work available for overlapping with the critical chain is limited due to either a tight loop or a slow memory Handling these situations which will worsen as the processor memory speed gap grows requires a mechanism that can address and prefetch arbitrary LDS nodes To illustrate our point Figure 1 a shows a list traversal loop e g for l list l l l next with the long latency compute jump pointer prefetch stall LDS induction load Figure 1 Hiding LDS load latency a Exposed induction load latency can be hidden by b scheduling it early in an iteration c This approach is ineffective if a single iteration has insufficient work d Jumppointers can leverage the work of multiple iterations We present a method for overlapping LDS load latency with the work of multiple iterations via the structured use of jump pointers Jump pointers are used strictly for prefetching Residing at some or all LDS nodes they point to nodes that are likely to be accessed in the near future not ones that are functionally adjacent As shown in figure 1 d jump pointer prefetching JPP overcomes the serial nature of LDS address generation and obtains the address of an otherwise non adjacent LDS node via a single low latency lookup This in turn allows us to overlap the access latencies of multiple nodes or equivalently to overlap the latency of one node with multiple iterations Our general framework combines jump pointer prefetching with chained prefetching which uses the pointers available in the original unmodified program We show that jump pointer prefetching and chained prefetching can be combined in different ways to create four prefetching idioms which we call queue jumping full jumping chain jumping and root jumping Since both jump pointer prefetching and chained prefetching can be implemented in either hardware or software each idiom can be instantiated in one of three implementations software hardware and cooperative The cooperative scheme handles jumppointer prefetching in software and chained prefetching in hardware Each idiom implementation combination has advantages and drawbacks that make it suitable for certain kinds of LDS traversals We study a set of pointer intensive benchmarks and attempt to isolate the program features that best guide idiom and implementation selection Our experiments show that software cooperative and hardware prefetching eliminate an average of 72 83 and 55 of the total memory stall time in these programs translating into speedups of 15 20 and 22 respectively This is a significant improvement over other known schemes This rest of the paper is organized as follows The next section presents our JPP framework and a benchmark characterization The three implementations are described in Section 3 and evaluated in Section 4 The last sections discuss related and future work and our conclusions 2 Jump pointer Prefetching Framework Our prefetching framework can be described in terms of two building blocks jump pointer prefetches and chained prefetches Jump pointers are pointers added to the program s data structures for prefetching purposes only We say that a jump pointer resides in a home node and points to a target node Jump pointer prefetches prefetch target nodes using the jump pointer at the home node For prefetching to succeed the target of a jump pointer must point to a node that is likely to be referenced some time after the corresponding home node Chained prefetches on the other hand do not require jump pointers they prefetch using the original pointers in the structure Each of these types of prefetch provides different benefits and has different associated performance costs Jump pointer prefetches can prefetch arbitrary LDS nodes hide arbitrary amounts of latency and allow otherwise serial prefetches to execute in parallel However jump pointers require
View Full Document
Unlocking...