Unformatted text preview:

Abstract — In this paper we explore methods for hiding memory latency. We analyzed the memory traces of several applications and determined that patterns exist with little spatial locality. Therefore, these patterns will perform poorly in the L1 data cache. With increasing memory latency, it is important to capture and store these patterns in order to feed data to wide-issue superscalar processors. To exploit these patterns, we propose a novel architecture, the Memory Address Trace CacHe (MATCH), that captures and stores recurring memory access patterns. The architecture is composed of three primary structures located off the critical path: the Pattern Generator, Pattern Buffer and MATCH Cache. Using SimpleScalar, we simulated 5 SPEC2000 benchmarks on our architecture to gauge performance relative to standard cache structures. Our results indicate that MATCH significantly improves performance on Ammp, which has speedups ranging from 24 to 43 percent. These results demonstrate that MATCH has potential but requires further refinement. Index Terms — memory traces, trace cache, MATCH, prefetching I. INTRODUCTION As companies continue to introduce higher speed processors into the market, the disparity between transistor switching times and memory access latency increases. To continue the performance improvement defined by Moore’s Law, one must analyze and improve the system as a whole in order to obtain the desired result. To this end, steps must be taken to narrow the gap between memory latencies and processor frequencies. The relative increase in memory access times plays a key role in several major functions of the microprocessor. In order to feed wide-issue the superscalar implementations prevalent toady, computer architects must deal with in memory latency in both the instruction and data realms. In short, a the processor must be able to fetch, decode, and issue enough instructions and access the appropriate data every cycle to utilize all of its available functional units. In order to combat increasing instruction memory latency, researchers have focused on improving the cache structure of the microprocessor in an effort to hide some of the memory latency. Deeper and more complex cache structures have been implemented and proposed over the past few years. In addition, large issue buffers, more physical registers, and speculative fetch engines have all been examined in order to supply the execution core with the necessary, continuous-stream of instructions and the data on which the instructions operate. After effectively creating a continuous supply of instructions to the core, one must next make sure that the instructions can access the necessary data in order to efficiently utilize the processor’s functional units. We continue to see that, while speculation efforts have improved dramatically, the large data memory latency is increasingly the major bottleneck for modern computer systems. While the memory latencies at the supply side of the execution core have been decreased through a wide variety of new techniques, the latency to main memory during the execution of instructions has remained mainly dependent on brute-force improvements, such as improved DRAMs to provide faster buses and higher clocked chips or caches that are limited in size by increasing wire delay. The remainder of the paper is organized as follows. In Section II we provide more insight into the motivation behind our memory address trace cache and discuss our hypothesis. In Section III we present the architecture for MATCH. Section IV focuses on our experimental methods. Information will be given on the simulations performed and the critical parameters examined. Section V will present our results, which include the performance and analysis of our simulations. We conclude in Section VI with an overview of our key results along with a discussion of the limitations in our MATCH: Memory Address Trace CacHe Rice University – ELEC 525 Final Report Noah Deneau, Michael Haag, David Leal, and Arthur NieuwoudtELEC 525 FINAL REPORT 2study and possible future work. II. MOTIVATION AND HYPOTHESIS A. Motivation To continue enhancing processor performance, the supply of instructions as well as data must scale with faster clocks and wide issue widths. Several studies have focused on increasing the number of instructions supplied to the processor by using alterative fetch and issue policies [1] or trace caches [2]. Data prefetching schemes related to our concept in hardware [4] and in the compiler [5] have also been investigated. Traditionally, these researchers approached the problem of the memory latency by implementing newer and more optimized cache hierarchies. However, these methods fail to address applications with poor spatial locality. In addition, as caches grow larger, longer blocks of contiguous data from memory are sent across the memory bus and occupy space in the cache, even when only a fraction of this data is required. These types of accesses waste cache space and memory bus capacity. In recent years, we have seen that the spatial and temporal locality of programs will no longer be able to hide this problem [3]. Coupled with the above difficulty is the fact that the programs running on today’s machines are changing rapidly. The emergence of more parallelized applications, requires more noncontiguous accesses to memory. In these situations, reading in larger blocks from cache will waste precious system resources while providing only marginal benefits. As systems adapt to these more parallelized operations, we will see larger number of functional units added to the execution core and more resources devoted to feeding these units with instructions. We feel that a more pressing issue is not the functional units, which are relatively cheap, nor the supplying of instructions, which has seen significant improvements over the past few years, but rather a fast and efficient method to feed data into the execution core from main memory. In order to reduce some of the memory latencies, we propose utilizing the concept of a trace cache [2] but adapting it to provide traces of data memory access patterns. In order to be effective, however, we need to be able to adequately understand what, if any patterns, are available in the memory accesses of a wide range of applications. To further motivate the ideas for our project, we created and analyzed memory traces of five


View Full Document
Download Memory Address Trace Cache
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Memory Address Trace Cache and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Memory Address Trace Cache 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?