Unformatted text preview:

Locality vs Criticality Srikanth T Srinivasan Roy Dz ching Ju Alvin R Lebeck Department of Computer Science Microprocessor Research Labs Duke University Intel Corporation sri alvy cs duke edu roy ju chris wilkerson intel com Abstract move from the processor to main memory Caches work by exploiting spatial and temporal locality in a program s access pattern Spatial locality is exploited by fetching a region of memory called a cache b l o c k rather than just the accessed data Caches exploit temporal locality by retaining recently accessed cache blocks Most cache management schemes try to exploit locality to increase the fraction of memory accesses satisfied by the cache i e cache hit ratio Although increasing the overall number of cache hits is usually desirable recent research 19 shows that not all memory accesses are equal In dynamically scheduled processors the latency of some memory load operations can have a much larger influence on overall performance than other loads Therefore it may be possible to improve overall performance by decreasing the latency of these critical loads at the expense of increased latency for non critical loads The goal of this paper is to determine if criticality is a strong enough program property to warrant a change in memory hierarchy management techniques for practical implementations Specifically we investigate if practical criticality based approaches can equal or surpass the performance of existing locality based techniques The previous analysis 19 relied on a sophisticated simulator with rollback to determine load criticality A practical implementation requires hardware support for on line computation of load criticality This must be augmented by realistic load latency reduction techniques that can exploit this information In this paper we propose a hardware implementation to determine load criticality and validate that there is potential to improve performance over a traditional cache hierarchy We then compare locality based cache organization and prefetching techniques with corresponding criticalitybased schemes Our criticality based cache organization scheme uses a critical cache that is functionally similar to a conventional victim cache 10 but holds only blocks that were touched by a critical load We also examine multi line prefetching 9 12 based on both locality and criticality We use Simplescalar 1 to evaluate the above techniques for a set of SPEC 2000 integer and Olden 16 benchmarks Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses This leads to a potential mis match between load latency requirements and latencies realized using a traditional memory system To bridge this gap we partition loads as critical and non critical A load that needs to complete early to prevent processor stalls is classified as critical while a load that can tolerate a long latency is considered non critical In this paper we investigate if it is worth violating locality to exploit information on criticality to improve processor performance We present a dynamic critical load classification scheme and show that 40 performance improvements are possible on average if all critical loads are guaranteed to hit in the LI cache We then compare the two properties locality and criticality in the context of several cache organization and prefetching schemes We find that the working set of critical loads is large and hence practical cache organization schemes based on criticality are unable to reduce the critical load miss ratios enough to produce performance gains Although criticality based prefetching can help for some resource constrained programs its benefit over locality based prefetching is small and may not be worth the added complexity 1 Chris Wilkerson Introduction The wide spread use of cache memory hierarchies 17 is a testament to their success as a cost effective solution to help alleviate the growing disparity between processor cycle time and DRAM access time By using a small fast but expensive SRAM cache to satisfy most accesses cache hierarchies can provide access times close to processor speeds while cheaper slower DRAM provides large capacity to avoid I O delays Increases in transistor budgets enable most systems today to employ two or three levels in a cache hierarchy with increasing capacity and access time as you 132 1063 6897 01 10 00 2001 IEEE 2 1 The primary result from our simulations is that load criticality is not a sufficiently strong property to warrant cache management changes Although criticality can be used as a filter for making prefetching decisions it does not appear to be worth the added complexity In particular our results reveal the following Our previous work on load latency uses sophisticated simulation with rollback to determine how long a load could remain outstanding without degrading performance relative to an ideal memory system where all loads hit in the L1 cache While this is a reasonable approach for limit studies it is unacceptable for practical use However many of the insights from that study can be used to develop a practical implementation for determining load criticality The previous study shows that load criticality is determined to a large extent by the chain of instructions dependent on the load In particular the type of dependent instructions e g mispredicted branch indicates if the load is likely to degrade processor performance Similarly the number of instructions in its dependence chain indicates if a long latency load will stress the processor s finite resources If most instructions issued after a load are dependent on the load the processor may stall unable to find independent instructions to execute in subsequent cycles From these observations we can construct hardware to determine load criticality by monitoring the type of instructions in a load s dependence chain and counting the number of independent instructions issued in cycles immediately after the load In our design a load is classified as critical if it satisfies any of the following criteria 1 The load feeds into a mispredicted branch 2 The load feeds into another load that incurs an LI cache miss or 3 The number of independent instructions issued in an N cycle window following the load is below a threshold Our hardware design is based around the Register Update Unit RUU Load


View Full Document

CMU CS 15740 - Locality vs. Criticality

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Locality vs. Criticality and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Locality vs. Criticality and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?