The Block-based Trace CacheBryan Black, Bohuslav Rychlik, and John Paul ShenDepartment of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburgh, PA 15213{black,bohuslav,shen}@ece.cmu.eduAbstractThe trace cache is a recently proposed solution toachieving high instruction fetch bandwidth by buffering andreusing dynamic instruction traces. This work presents anew block-based trace cache implementation that canachieve higher IPC performance with more efficient stor-age of traces. Instead of explicitly storing instructions of atrace, pointers to blocks constituting a trace are stored in amuch smaller trace table. The block-based trace cache re-names fetch addresses at the basic block level and storesaligned blocks in a block cache. Traces are constructed byaccessing the replicated block cache using block pointersfrom the trace table. Performance potential of the block-based trace cache is quantified and compared with perfectbranch prediction and perfect fetch schemes. Comparing tothe conventional trace cache, the block-based design canachieve higher IPC, with less impact on cycle time. Results: Using the SPECint95 benchmarks, a 16-widerealistic design of a block-based trace cache can improveperformance 75% over a baseline design and to within 7%of a baseline design with perfect branch prediction. Withidealized trace prediction, it is shown the block-based tracecache with an 1K-entry block cache achieves the same per-formance of the conventional trace cache with 32K entries.1 IntroductionMost technologists anticipate the continuation ofMoore’s law of increasing chip density and complexity foranother 10 years. However, existing superscalar techniquesfor harvesting instruction-level parallelism (ILP) are en-countering strong diminishing returns. In order to justifybuilding wider superscalar processors, new microarchitec-ture techniques capable of achieving significantly higherIPC (average instructions executed per cycle) for ordinaryprograms are essential.1.1 High Bandwidth Instruction FetchingA critical challenge to achieving high IPC is supplyingenough useful instructions for execution. High-bandwidthinstruction fetching must address several problems: First,the machine must perform multiple branch predictions inevery cycle. Second, the fetch engine must be able to fetchfrom multiple non-contiguous addresses in every cycle,reaching beyond the taken branch. Third, misaligned in-structions, from the multiple fetch groups, must be col-lapsed into the fetch buffer. 1.2 Previous Related WorkThe three challenges in high-bandwidth instructionfetching: multiple-branch prediction, multiple fetch groups,and instruction alignment and collapsing, have been ad-dressed in previous studies. These previous works fall intotwo general categories, those that use an enhanced instruc-tion cache [3][16][18][19] and those that use a trace cache[5][8][14][15]. The fundamental difference between themis where in the pipeline instructions are aligned and col-lapsed. Both perform multiple-branch prediction the cyclebefore instruction fetch, and update the predictor at comple-tion. Techniques that use an enhanced instruction cachesupport fetch of non-contiguous blocks with a multi-ported,multi-banked, or multiple copies of the instruction cache.This leads to multiple fetch groups that must be aligned andcollapsed at fetch time, which can increase the fetch laten-cy. The conventional trace cache performs all instructionalignment and collapsing at completion time. At completiontime a fill unit constructs traces of instructions to be storedin the trace cache. The conventional trace cache optimizesthe fetch latency by shifting the complexity of multiple-branch prediction, multiple fetch groups, and instructionalignment and collapsing to the completion time.This work proposes a block-based trace cache that is ableto achieve higher IPC performance while requiring less1063-6897/99/$10.00 (c) 1999 IEEEtrace storage capacity. It also facilitates more flexible traceprediction schemes; for example, partial matching [5] canbe easily and naturally implemented. Furthermore, it has thepotential of enabling clean implementation of other dynam-ic optimizations, such as multi-path execution, dynamicpredication and dynamic multithreading.The block-based trace cache employs a number of novelfeatures. Instead of explicitly storing instructions of a trace,pointers to blocks constituting a trace are stored in a muchsmaller trace table. The block-based trace cache renamesfetch addresses at the basic block level and stores alignedblocks in a block cache. Traces are constructed by accessingthe replicated block cache using block pointers from thetrace table.Figure 1 and Figure 2 illustrate the basic difference be-tween the conventional and the block-based trace caches.Both designs perform next trace prediction in the cycle pri-or to the fetch cycle. The conventional design employs a so-phisticated path-based next trace predictor [8] to producethe next trace identifier which is used in the fetch cycle toaccess the trace cache. Similarly, the block-based tracecache makes the next trace prediction by accessing the tracetable which outputs the identifiers of the blocks in the pre-dicted trace. These block identifiers are used in the fetch cy-cle to access the replicated block cache. Section 3.1.3 willshow the accessing of the replicated block cache does notincrease the fetch latency of the block-based design.This paper starts with a discussion about fetch addressrenaming in Section 2, then uses Section 3 to detail the im-plementation and design trade-offs of the block-based tracecache. Section 4 discusses the simulation methodology forthis work, and Section 5 explores the design space quantita-tively. Performance potential of the block-based trace cacheis quantified and compared with that of perfect branch pre-diction and perfect fetch schemes in Section 6. It is shownthat the block-based trace cache achieves IPC higher thanFigure 1 - The conventional trace cache.Fetch BufferCompletionbr.Trace Cachetrace_idI-CacheExecutionCoreHistoryHashhist.Fill UnitNextTracePred.that of perfect branch prediction and approaching that ofperfect fetch if an idealized trace predictor is used.Section 7 compares the block-based trace cache to the con-ventional trace cache implementation, and demonstrates amuch higher IPC when trace storage capacity is limited.2 Fetch Address RenamingThe implementation of the block-based trace cache
View Full Document