UW-Madison ECE/CS 752 - Identifying Critical Loads and a study of their Access Partners

Unformatted text preview:

IDENTIFYING CRITICAL LOADS AND A STUDY OF THEIR ACCESS PATTERNSCS752 PROJECT REPORTANURAG GUPTA, VISHAL KATHURIA{ anurag , vishal }@cs.wisc.eduComputer Sciences DepartmentUniversity of Wisconsin, MadisonMadison – 53706, WIDecember 15, 1998ABSTRACT1. INTRODUCTION2. BACKGROUND3. IDENTIFICATION OF THE CRITICAL LOADS3.4 Maturation of Dependency List4. IMPLEMENTATION4.1 Average Address Difference4.2 Average Instruction Gap5. RESULTS AND OBSERVATIONS6. EXPLORING THE DESIGN SPACE6.1 Pin the Critical Data in CacheAdvantagesDoesn't require too many extra transistors to implement.6.2 Store the critical data in a separate bufferSome of the issues involved in this scheme are:6.2.2 Buffer Pollution7. SUMMARY8. FUTURE WORKIDENTIFYING CRITICAL LOADS ANDA STUDY OF THEIR ACCESSPATTERNS CS752 PROJECT REPORTANURAG GUPTA, VISHAL KATHURIA{ anurag , vishal }@cs.wisc.eduComputer Sciences DepartmentUniversity of Wisconsin, MadisonMadison – 53706, WIDecember 15, 1998ABSTRACTThe contribution of this report is an analysis of the access patterns of the loads that are critical to theperformance of an out of order processor (henceforth called the Critical Loads). Our measurements reveal that30%-40% (an in some cases, upto 98%) of the critical load instructions access the same memory location theyaccessed the last time they were executed. Moreover, in more than 80% of the cases, the successive occurrencesof a critical load instruction in the dynamic instruction stream are atleast 512 instructions apart. On the basisof above analysis, we present a memory management technique to decrease the latency of critical loads andspeedup the overall performance of an out of order processor.1. INTRODUCTIONThere has been a tremendous improvement in the processor performance over the last couple ofdecades. Today, most processors are capable of extremely high clock rates with complete out of order executionresulting in high performance. But on the other hand, improvement in the main memory performance has notkept pace with the processor performance. For example, processor performance has been increasing at almost50% per year whereas memory access times have been improving at only 5-10% per year only [2]. As a result,there is a wide gap between the memory and processor performances. A balance should be maintained betweenthe two, which unfortunately has not happened over the years. As a result, the memory access time due to acache miss is likely to become a major bottleneck in the future.Looking at the other face of the coin, as mentioned earlier, most modern processors are capable of outof order execution. They can buffer the instructions that are waiting for their operands from memory. So, if aninstruction is waiting for its data from the main memory, then the processor can execute other independentinstructions and still achieve high performance. So, why bother to speed up the critical instructions?On close examination, we see that although the processor can use dynamic scheduling and buffering toexecute other independent instructions, it is not always possible to find independent instructions to execute. Thiswill result in the processor to stall, thereby reducing performance. Secondly, although the processor may employsophisticated branch prediction mechanisms and allows speculative execution to execute instructions out oforder, it must always commit the instructions in order. Thus, the finite resources of the processors may alsocause it to stall. For example, the reorder buffer and the load/store queues have a finite size, if they are filled upwaiting for an instruction at the head of the list, then the processor must stall. Thirdly, although most processors2employ sophisticated branch prediction schemes, it mispredicts many times too. If a critical load is feeding amispredicted branch, then it will cause the processor to go down the wrong execution path. This will have aneffect that although the processor is kept busy, no useful work is done.Thus, it becomes important to identify such critical loads and provide mechanisms to speed up theirexecution so that they don’t have to wait too long for their operands from memory, thereby allowing instructionsthat depend on them for their operands not to stall and improve processor performance. We present a study of the access patterns of these critical loads (including a study of their spatial andtemporal locality). We also present a hardware scheme that could be employed to decrease the latency of thesecritical loads. The study of the access patterns of these Critical Loads presents some interesting results, whichwe discuss and utilize to propose the hardware schemes.The report is organized as follows. In Section 2, we will discuss what we mean by Critical Loads andhow we can categorize loads as critical. In Section 3, we discuss the mechanism we have employed to identifythese critical loads. Section 4 discusses the implementation details. Section 5 illustrates the results andobservations. We interpret the results obtained and make some observations about the spatial and temporallocality of the critical loads. In Section 6 we propose a hardware scheme based on the study of the accesspatterns of these critical loads. In Section 7 we summarize our work done and point out some of the limitationsof our study and in Section 8, we conclude with a mention of some of the future directions.2. BACKGROUNDIn the execution stream of instructions, there are certain instructions on which other instructions aredependent for their data values. The data values from memory for these instructions should be fetched asquickly as possible so that the instruction does not have to wait too long. These instructions are Critical and theprocessor has a Low Latency Tolerance for them.We categorize loads as Critical which fall in the following two categories [1]:1. The loads on which the branches, especially the ones which are mispredicted depend, and2. The loads on which a large number of succeeding instructions depend.Thus, all loads in an execution stream of a program can be categorized into the above two categories. Ifa branch depends on the operand from the load instruction then it must be computed as soon as possible. Also, if3there are a large number of instructions that directly or indirectly depend on this load, then this load becomescritical and needs to be completed quickly otherwise the processor might not be able to tolerate its


View Full Document

UW-Madison ECE/CS 752 - Identifying Critical Loads and a study of their Access Partners

Download Identifying Critical Loads and a study of their Access Partners
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Identifying Critical Loads and a study of their Access Partners and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Identifying Critical Loads and a study of their Access Partners 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?