Rice ELEC 525 - Towards a More Efficient Trace Cache

Unformatted text preview:

1Abstract— If the trace cache size is not large enough to contain all of the basic blocks of running application, a judicious hit and replacement logic becomes very important. This report proposes a weight-based technique to select the victim line in trace cache for the replacement logic. It also presents a judicious line-fill buffer logic which is found to decrease the redundancy in the trace cache. We did the performance study by simulating these techniques on SimpleScalar. For SimpleScalar test benchmarks applications, a trace cache with the proposed replacement and line-fill buffer logic was found to provide 1-5% better IPC than a trace cache with a Least Recently Used replacement logic. 1 INTRODUCTION Superscalar processors are now capable of executing a large number of instructions per cycle. In order to benefit fully from instruction-level parallelism (ILP) techniques, we must prevent the instruction fetch performance from becoming a bottleneck. There are a number of factors that limit the capabilities of current instruction fetch mechanisms. These factors include instruction cache hit rate, branch prediction accuracy, branch throughput, noncontiguous instruction alignment, and fetch unit latency [1]. One possible solution for addressing some of these issues is the trace cache as proposed by Rotenberg, Bennett, and Smith [1]. The trace cache provides a method of storing the dynamic instruction stream making otherwise non-contingous instructions appear continguous. It operates by storing up to n instructions per trace cache line and using m branch predictions per cycle. The line is filled when a new address is encountered and typically contains m branch outcomes. The line from the trace cache is then sent to the decoder when the same starting address for the line is encountered and the branch predictions are matched correctly. An implementation of the trace cache fetch mechanism can be seen in Figure 1. To increase the performance of trace cache, many techniques have been proposed by Rotenberg et al [1] and others [2,4]. However, none of the proposed techniques deal with the hit and replacement logic or line-fill-buffer logic explicitlty. The following section introduces why the hit and replacement logic is important and proposes an outline for a possible solution. 2 MOTIVATION AND HYPOTHESIS By comparing the size of SPEC2000 applications with those of previous years, we can easily see a trend towards large-sized applications. The trace cache will give ideal performance when it is large enough to keep all of the basic blocks of the application. Since applications are very large and trace caches are very limited in size, better logic must be provided to control which basic blocks should be placed in the trace cache based upon some hit and replacement logic. The performance of applications will benefit from a trace cache having a judicious hit and replacement logic. Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Towards a More Efficient Trace Cache Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish, amsaha, yen}@rice.edu Figure 1: The trace cache fetch mechanism[1]2(b) The simple Least Recently Used (LRU) based technique to identify the victim line in trace cache may not be very helpful if the trace cache size is very small. This is due to the fact that there may be many trace cache lines with fewer basic blocks than the line can store, and thus wasting trace cache capacity. Similarly, a trace cache line may have a block sequence with many non-taken branches, thus logically giving a continuous stream. We can use these properties of the lines to decide how to select a victim line in the trace cache to be replaced with the new line. As a result, we are proposing a weight-based technique as the replacement logic. The other problem which becomes more prominent in the case of having small trace cache is shown in Figure 3 and explained in more detail in the next section. It involves redundancy in the trace cache due to the presence of multiple entries of same basic block. To avoid this redundancy, we propose a more judicious line-fill buffer logic. Therefore, we hypothesize that a weighted replacement logic and a modification in the line-fill buffer logic will increase peformance. Our hypothesis was based upon the techniques of selecting the victim line and avoiding redundancy in the trace cache judiciously. Using an extremely large trace cache, where no replacement policy is required because blocks will always have room to be entered into the trace cache, and a large number of execution resources, the possible performance improvement by the trace cache can be seen. In Figure 2, there is room for an average performance increase of 10-20% over the case of a processor with a limited size trace cache using LRU replacement policy. The techniques proposed will benefit most scientific applications which are comprised of numerous loops and non-contiguous code. If there are not a lot of cyclic calls or loops in the application code, then an LRU-based policy may be more useful. However, in such applications, a trace cache will only have marginal improvements. 3 ARCHITECTURE 3.1 The basic trace cache design The trace cache was implemented in a similar manner as depicted in Figure 1 with full associativity. In our implementation m is 3 and n is 16. The core fetch unit is the same as the fetch unit used in the SimpleScalar simulator [3]. It fetches instruction from only one instruction cache line per cycle. If there is a miss in the instruction cache, then it blocks until the miss is completed. Once the instructions are fetched, they are placed in a dispatch (or decode) queue. If the fetch unit encounters a branch, it utilizes the branch predictor to obtain the correct cache line to access. The trace cache is filled by taking instructions after the committed stage. Past research has found that this does not improve or worsen the hit percentage. The length of the line-fill buffer is limited by either the number of instructions n or the number of basic blocks m, whichever comes first. Once the line-fill buffer is filled, the line is flushed to the trace cache. If there is an empty line in the trace Figure 2: Possible Improvement with Trace Cache (a) SPEC2000 Benchmarks (b) SimpleScalar Test Benchmarks (a) IPC Improvement With 64 Entry Trace Cache00.511.522.5ammp mcf* vpr


View Full Document
Download Towards a More Efficient Trace Cache
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Towards a More Efficient Trace Cache and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Towards a More Efficient Trace Cache 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?