Locality vs. Criticality Srikanth T. Srinivasan + Roy Dz-ching Ju ~ Alvin R. Lebeck + Chris Wilkerson$ +Department of Computer Science *Microprocessor Research Labs Duke University Intel Corporation {sri, alvy} @cs.duke.edu {roy.ju, chris.wilkerson} @intel.com Abstract Current memory hierarchies exploit locality of refer- ences to reduce load latency and thereby improve proces- sor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a potential mis-match between load latency requirements and latencies realized using a tradi- tional memory system. To bridge this gap, we partition loads as critical and non-critical. A load that needs to com- plete early to prevent processor stalls is classified as criti- cal, while a load that can tolerate a long latency is consid- ered non-critical. In this paper, we investigate if it is worth violating local- ity to exploit information on criticality to improve processor performance. We present a dynamic critical load classifica- tion scheme and show that 40% performance improvements are possible on average, if all critical loads are guaranteed to hit in the LI cache. We then compare the two proper- ties, locality and criticality, in the context of several cache organization and prefetching schemes. We find that the working set of critical loads is large, and hence practical cache organization schemes based on criticality are unable to reduce the critical load miss ratios enough to produce performance gains. Although criticality-based prefetching can help for some resource constrained programs, its ben- efit over locality-based prefetching is small and may not be worth the added complexity. 1 Introduction The wide-spread use of cache memory hierarchies [17] is a testament to their success as a cost-effective solution to help alleviate the growing disparity between processor cy- cle time and DRAM access time. By using a small, fast, but expensive SRAM cache to satisfy most accesses, cache hier- archies can provide access times close to processor speeds, while cheaper, slower DRAM provides large capacity to avoid I/O delays. Increases in transistor budgets enable most systems today to employ two or three levels in a cache hierarchy, with increasing capacity and access time as you move from the processor to main memory. Caches work by exploiting spatial and temporal locality in a program's access pattern. Spatial locality is exploited by fetching a region of memory---called a cache block-- rather than just the accessed data. Caches exploit tem- poral locality by retaining recently accessed cache blocks. Most cache management schemes try to exploit locality to increase the fraction of memory accesses satisfied by the cache (i.e., cache hit ratio). Although increasing the overall number of cache hits is usually desirable, recent research [19] shows that not all memory accesses are equal. In dynamically scheduled pro- cessors, the latency of some memory load operations can have a much larger influence on overall performance than other loads. Therefore, it may be possible to improve over- all performance by decreasing the latency of these critical loads at the expense of increased latency for non-critical loads. The goal of this paper is to determine if criticality is a strong enough program property to warrant a change in memory hierarchy management techniques for practical implementations. Specifically, we investigate if practical criticality-based approaches can equal or surpass the perfor- mance of existing locality-based techniques. The previous analysis [19] relied on a sophisticated simulator with roll- back to determine load criticality. A practical implementa- tion requires hardware support for on-line computation of load criticality. This must be augmented by realistic load latency reduction techniques that can exploit this informa- tion. In this paper, we propose a hardware implementation to determine load criticality and validate that there is poten- tial to improve performance over a traditional cache hier- archy. We then compare locality-based cache organization and prefetching techniques with corresponding criticality- based schemes. Our criticality based cache organization scheme uses a critical cache, that is functionally similar to a conventional victim cache [ 10], but holds only blocks that were touched by a critical load. We also examine multi-line prefetching [9, 12] based on both locality and criticality. We use Simplescalar [ 1] to evaluate the above techniques for a set of SPEC 2000 integer and Olden [ 16] benchmarks. 1063-6897/01 $10.00 © 2001 IEEE 132The primary result from our simulations is that load criti- cality is not a sufficiently strong property to warrant cache management changes. Although, criticality can be used as a filter for making prefetching decisions, it does not appear to be worth the added complexity. In particular, our results reveal the following: There is potential to exploit criticality. If all the hard- ware identified critical loads could be satisfied by the L1 cache, it would produce an average 40% improve- ment over a traditional memory hierarchy. This com- pares to an average 12% improvement if an equal per- centage of loads are randomly chosen to hit in tile L1 cache. Managing cache content based on locality outperforms criticality-based techniques. The working set of criti- cal loads is large and because of spatial locality be- tween non-critical and critical loads, a locality-based cache provides lower critical load miss ratios than a criticality-based cache. Criticality-based prefetching into the L2 cache achieves an average speedup of 4%, compared to 2% for locality-based prefetching, across several bench- marks with resource constraint problems. However, if resource constraints are not a problem, the most ag- gressive locality-based prefetching scheme does best. The remainder of this paper is organized as follows. Section 2 provides background on load criticality, presents our hardware technique for determining load criticality and evaluates the potential for improved performance. We eval- uate cache organization schemes in Section 3 and Section 4 investigates prefetching.
View Full Document