Rice ELEC 525 - Memory Address Trace Cache (10 pages)

Previewing pages 1, 2, 3 of 10 page document View the full content.
View Full Document

Memory Address Trace Cache



Previewing pages 1, 2, 3 of actual document.

View the full content.
View Full Document
Unformatted text preview:

MATCH Memory Address Trace CacHe Rice University ELEC 525 Final Report Noah Deneau Michael Haag David Leal and Arthur Nieuwoudt Abstract In this paper we explore methods for hiding memory latency We analyzed the memory traces of several applications and determined that patterns exist with little spatial locality Therefore these patterns will perform poorly in the L1 data cache With increasing memory latency it is important to capture and store these patterns in order to feed data to wide issue superscalar processors To exploit these patterns we propose a novel architecture the Memory Address Trace CacHe MATCH that captures and stores recurring memory access patterns The architecture is composed of three primary structures located off the critical path the Pattern Generator Pattern Buffer and MATCH Cache Using SimpleScalar we simulated 5 SPEC2000 benchmarks on our architecture to gauge performance relative to standard cache structures Our results indicate that MATCH significantly improves performance on Ammp which has speedups ranging from 24 to 43 percent These results demonstrate that MATCH has potential but requires further refinement Index Terms memory traces trace cache MATCH prefetching I INTRODUCTION As companies continue to introduce higher speed processors into the market the disparity between transistor switching times and memory access latency increases To continue the performance improvement defined by Moore s Law one must analyze and improve the system as a whole in order to obtain the desired result To this end steps must be taken to narrow the gap between memory latencies and processor frequencies The relative increase in memory access times plays a key role in several major functions of the microprocessor In order to feed wide issue the superscalar implementations prevalent toady computer architects must deal with in memory latency in both the instruction and data realms In short a the processor must be able to fetch decode and issue enough instructions and access the appropriate data every cycle to utilize all of its available functional units In order to combat increasing instruction memory latency researchers have focused on improving the cache structure of the microprocessor in an effort to hide some of the memory latency Deeper and more complex cache structures have been implemented and proposed over the past few years In addition large issue buffers more physical registers and speculative fetch engines have all been examined in order to supply the execution core with the necessary continuous stream of instructions and the data on which the instructions operate After effectively creating a continuous supply of instructions to the core one must next make sure that the instructions can access the necessary data in order to efficiently utilize the processor s functional units We continue to see that while speculation efforts have improved dramatically the large data memory latency is increasingly the major bottleneck for modern computer systems While the memory latencies at the supply side of the execution core have been decreased through a wide variety of new techniques the latency to main memory during the execution of instructions has remained mainly dependent on brute force improvements such as improved DRAMs to provide faster buses and higher clocked chips or caches that are limited in size by increasing wire delay The remainder of the paper is organized as follows In Section II we provide more insight into the motivation behind our memory address trace cache and discuss our hypothesis In Section III we present the architecture for MATCH Section IV focuses on our experimental methods Information will be given on the simulations performed and the critical parameters examined Section V will present our results which include the performance and analysis of our simulations We conclude in Section VI with an overview of our key results along with a discussion of the limitations in our ELEC 525 FINAL REPORT 2 TABLE I PATTERN STATISTICS FROM MEMORY TRACES study and possible future work II MOTIVATION AND HYPOTHESIS App A Motivation To continue enhancing processor performance the supply of instructions as well as data must scale with faster clocks and wide issue widths Several studies have focused on increasing the number of instructions supplied to the processor by using alterative fetch and issue policies 1 or trace caches 2 Data prefetching schemes related to our concept in hardware 4 and in the compiler 5 have also been investigated Traditionally these researchers approached the problem of the memory latency by implementing newer and more optimized cache hierarchies However these methods fail to address applications with poor spatial locality In addition as caches grow larger longer blocks of contiguous data from memory are sent across the memory bus and occupy space in the cache even when only a fraction of this data is required These types of accesses waste cache space and memory bus capacity In recent years we have seen that the spatial and temporal locality of programs will no longer be able to hide this problem 3 Coupled with the above difficulty is the fact that the programs running on today s machines are changing rapidly The emergence of more parallelized applications requires more noncontiguous accesses to memory In these situations reading in larger blocks from cache will waste precious system resources while providing only marginal benefits As systems adapt to these more parallelized operations we will see larger number of functional units added to the execution core and more resources devoted to feeding these units with instructions We feel that a more pressing issue is not the functional units which are relatively cheap nor the supplying of instructions which has seen significant improvements over the past few years but rather a fast and efficient method to feed data into the execution core from main memory In order to reduce some of the memory latencies we propose utilizing the concept of a trace cache 2 but adapting it to provide traces of data memory access patterns In order to be effective however we need to be able to adequately understand what if any patterns are available in the memory accesses of a wide range of applications To further motivate the ideas for our project we created and analyzed memory traces of five applications from the SPEC benchmark suite Ammp Vpr Mcf Equake and Parser Each application Ave Pattern Length Ave Pattern


View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Memory Address Trace Cache and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Memory Address Trace Cache and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?