The Block based Trace Cache Bryan Black Bohuslav Rychlik and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh PA 15213 black bohuslav shen ece cmu edu Abstract 1 1 The trace cache is a recently proposed solution to achieving high instruction fetch bandwidth by buffering and reusing dynamic instruction traces This work presents a new block based trace cache implementation that can achieve higher IPC performance with more efficient storage of traces Instead of explicitly storing instructions of a trace pointers to blocks constituting a trace are stored in a much smaller trace table The block based trace cache renames fetch addresses at the basic block level and stores aligned blocks in a block cache Traces are constructed by accessing the replicated block cache using block pointers from the trace table Performance potential of the blockbased trace cache is quantified and compared with perfect branch prediction and perfect fetch schemes Comparing to the conventional trace cache the block based design can achieve higher IPC with less impact on cycle time Results Using the SPECint95 benchmarks a 16 wide realistic design of a block based trace cache can improve performance 75 over a baseline design and to within 7 of a baseline design with perfect branch prediction With idealized trace prediction it is shown the block based trace cache with an 1K entry block cache achieves the same performance of the conventional trace cache with 32K entries 1 Introduction Most technologists anticipate the continuation of Moore s law of increasing chip density and complexity for another 10 years However existing superscalar techniques for harvesting instruction level parallelism ILP are encountering strong diminishing returns In order to justify building wider superscalar processors new microarchitecture techniques capable of achieving significantly higher IPC average instructions executed per cycle for ordinary programs are essential High Bandwidth Instruction Fetching A critical challenge to achieving high IPC is supplying enough useful instructions for execution High bandwidth instruction fetching must address several problems First the machine must perform multiple branch predictions in every cycle Second the fetch engine must be able to fetch from multiple non contiguous addresses in every cycle reaching beyond the taken branch Third misaligned instructions from the multiple fetch groups must be collapsed into the fetch buffer 1 2 Previous Related Work The three challenges in high bandwidth instruction fetching multiple branch prediction multiple fetch groups and instruction alignment and collapsing have been addressed in previous studies These previous works fall into two general categories those that use an enhanced instruction cache 3 16 18 19 and those that use a trace cache 5 8 14 15 The fundamental difference between them is where in the pipeline instructions are aligned and collapsed Both perform multiple branch prediction the cycle before instruction fetch and update the predictor at completion Techniques that use an enhanced instruction cache support fetch of non contiguous blocks with a multi ported multi banked or multiple copies of the instruction cache This leads to multiple fetch groups that must be aligned and collapsed at fetch time which can increase the fetch latency The conventional trace cache performs all instruction alignment and collapsing at completion time At completion time a fill unit constructs traces of instructions to be stored in the trace cache The conventional trace cache optimizes the fetch latency by shifting the complexity of multiplebranch prediction multiple fetch groups and instruction alignment and collapsing to the completion time This work proposes a block based trace cache that is able to achieve higher IPC performance while requiring less 1063 6897 99 10 00 c 1999 IEEE Next Trace Pred trace id History Hash Trace Cache br hist I Cache trace storage capacity It also facilitates more flexible trace prediction schemes for example partial matching 5 can be easily and naturally implemented Furthermore it has the potential of enabling clean implementation of other dynamic optimizations such as multi path execution dynamic predication and dynamic multithreading The block based trace cache employs a number of novel features Instead of explicitly storing instructions of a trace pointers to blocks constituting a trace are stored in a much smaller trace table The block based trace cache renames fetch addresses at the basic block level and stores aligned blocks in a block cache Traces are constructed by accessing the replicated block cache using block pointers from the trace table Figure 1 and Figure 2 illustrate the basic difference between the conventional and the block based trace caches Both designs perform next trace prediction in the cycle prior to the fetch cycle The conventional design employs a sophisticated path based next trace predictor 8 to produce the next trace identifier which is used in the fetch cycle to access the trace cache Similarly the block based trace cache makes the next trace prediction by accessing the trace table which outputs the identifiers of the blocks in the predicted trace These block identifiers are used in the fetch cycle to access the replicated block cache Section 3 1 3 will show the accessing of the replicated block cache does not increase the fetch latency of the block based design Fetch Buffer Execution Core Fill Unit Completion that of perfect branch prediction and approaching that of perfect fetch if an idealized trace predictor is used Section 7 compares the block based trace cache to the conventional trace cache implementation and demonstrates a much higher IPC when trace storage capacity is limited 2 Fetch Address Renaming The implementation of the block based trace cache is based on the concept of fetch address renaming This section presents the motivation for renaming fetch addresses suggests possible options for the renamed entity and argues that the basic block is a convenient and effective entity for renaming 2 1 Motivation for Renaming Traditionally the effective address of an instruction is used for its fetching Such addresses contain many bits An instruction cache big enough to support the full decoding of these bits is impractical and not necessary Typically a subset of the fetch address bits are used to index into the instruction cache which results in
View Full Document
Unlocking...