Unformatted text preview:

Memory Forwarding Enabling Aggressive Layout Optimizations by Guaranteeing the Safety of Data Relocation Chi Keung Luk Department of Computer Science University of Toronto Toronto Canada M5S 3G4 luk eecg toronto edu Abstract By optimizing data layout at run time we can potentially enhance the performance of caches by actively creating spatial locality facilitating prefetching and avoiding cache conflicts and false sharing Unfortunately it is extremely difficult to guarantee that such optimizations are safe in practice on today s machines since accurately updating all pointers to an object requires perfect alias information which is well beyond the scope of the compiler for languages such as C To overcome this limitation we propose a technique called memory forwarding which effectively adds a new layer of indirection within the memory system whenever necessary to guarantee that data relocation is always safe Because actual forwarding rarely occurs it exists as a safety net the mechanism can be implemented as an exception in modern superscalar processors Our experimental results demonstrate that the aggressive layout optimizations enabled by memory forwarding can result in significant speedups more than twofold in some cases by reducing the number of cache misses improving the effectiveness of prefetching and conserving memory bandwidth 1 Introduction As the gap between processor and memory speeds continues to grow memory latency is becoming a key performance bottleneck In addition there is growing concern that the bandwidth to the offchip memory hierarchy may also limit performance 5 The primary mechanism for reducing both the off chip latency and bandwidth requirements is the on chip cache hierarchy While caches are an important step toward addressing these problems they face several well known limitations For example multi word cache lines can improve performance by prefetching useful data when spatial locality is abundant but they can also waste bandwidth and displace useful data when it is not Caches can suffer pathologically bad behavior due to either mapping conflicts or false sharing 21 34 Prefetching techniques 9 25 31 can potentially hide cache miss latency but only if they can predict access patterns ahead of time and only if there is sufficient memory bandwidth A brute force approach of simply making caches larger is not likely to solve these problems in part because application problem sizes are also increasing rapidly and also because cache sizes are constrained by the requirement of low access latency and by hardware resource limitations In addition recent studies 5 18 have shown that the effectiveness of caches is often low because a significant fraction of cached data is not reused before it is displaced from the cache Hence the first problem to address is managing existing caches more intelligently to improve their effectiveness Todd C Mowry Computer Science Department Carnegie Mellon University Pittsburgh PA 15213 tcm cs cmu edu Cache performance depends on two factors when data items are accessed and where they exist in the address space Therefore software based techniques for improving cache performance typically do one of two things they either restructure the computation or else they restructure the data layout The idea behind restructuring the computation is that given a fixed data layout we would like to manipulate the ordering of accesses such that multiple accesses to the same data item or cache line occur close together in time thereby enhancing locality 8 37 In contrast the idea behind optimizing the data layout is that given that a set of data items are accessed close together in time in the original computation we would like to actively arrange them in the address space such that i we create spatial locality by allocating them at contiguous addresses thereby enhancing the effectiveness of long cache lines and simplifying prefetch address generation ii we avoid cache conflicts by ensuring that they do not reside in separate lines which map into the same cache sets and iii we avoid false sharing by ensuring that items accessed by different processors fall within separate cache lines While both approaches have received considerable attention in the past our focus in this study is on facilitating data layout optimizations There are two possibilities for when we manipulate data layout The first approach which we call static placement is to assign an object to its optimized address when it is created 6 The second approach is to move an object perhaps more than once after it has been allocated we refer to this latter approach as data relocation or simply relocation The advantage of static placement is its simplicity The advantage of relocation however is that it can adapt to dynamic program behavior Previous studies have shown that relocation based optimizations such as copying 23 33 and clustering 11 can offer impressive performance gains In general relocation based data layout optimizations involve the following three steps 1 Guaranteeing Correctness Either the programmer or the compiler must prove that relocating the data will never break the program otherwise the optimization is unsafe 2 Estimating the Cost Benefit Tradeoff The potential optimization should only be performed if the performance benefit is expected to outweigh the overheads involved in relocating the data This estimation could be based on some combination of programmer knowledge static compiler analysis profiling feedback or run time information 3 Generating Relocation Code Additional code must be inserted to perform the actual data relocation at run time Despite the high performance potential of many relocationbased optimizations the key stumbling block which often prevents them from being used in practice is the first step i e guaranteeing correctness To safely move data we must guarantee that any future references to the object will find it at its new location The fundamental problem is that updating the precise set of pointers 1 to a given object requires perfect aliasing information related to that object In general computing such precise information is beyond the capabilities of the compiler 2 and is even quite difficult for the programmer for large programs In the face of uncertainty we must conservatively assume that relocating an object will break the program no matter how unlikely this may seem in reality and therefore the optimization cannot be performed There is one mechanism


View Full Document

CMU CS 15740 - Lecture

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?