DOC PREVIEW
CMU CS 15740 - The Block-based Trace Cache

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

The Block-based Trace CacheBryan Black, Bohuslav Rychlik, and John Paul ShenDepartment of Electrical and Computer EngineeringCarnegie Mellon UniversityPittsburgh, PA 15213{black,bohuslav,shen}@ece.cmu.eduAbstractThe trace cache is a recently proposed solution toachieving high instruction fetch bandwidth by buffering andreusing dynamic instruction traces. This work presents anew block-based trace cache implementation that canachieve higher IPC performance with more efficient stor-age of traces. Instead of explicitly storing instructions of atrace, pointers to blocks constituting a trace are stored in amuch smaller trace table. The block-based trace cache re-names fetch addresses at the basic block level and storesaligned blocks in a block cache. Traces are constructed byaccessing the replicated block cache using block pointersfrom the trace table. Performance potential of the block-based trace cache is quantified and compared with perfectbranch prediction and perfect fetch schemes. Comparing tothe conventional trace cache, the block-based design canachieve higher IPC, with less impact on cycle time. Results: Using the SPECint95 benchmarks, a 16-widerealistic design of a block-based trace cache can improveperformance 75% over a baseline design and to within 7%of a baseline design with perfect branch prediction. Withidealized trace prediction, it is shown the block-based tracecache with an 1K-entry block cache achieves the same per-formance of the conventional trace cache with 32K entries.1 IntroductionMost technologists anticipate the continuation ofMoore’s law of increasing chip density and complexity foranother 10 years. However, existing superscalar techniquesfor harvesting instruction-level parallelism (ILP) are en-countering strong diminishing returns. In order to justifybuilding wider superscalar processors, new microarchitec-ture techniques capable of achieving significantly higherIPC (average instructions executed per cycle) for ordinaryprograms are essential.1.1 High Bandwidth Instruction FetchingA critical challenge to achieving high IPC is supplyingenough useful instructions for execution. High-bandwidthinstruction fetching must address several problems: First,the machine must perform multiple branch predictions inevery cycle. Second, the fetch engine must be able to fetchfrom multiple non-contiguous addresses in every cycle,reaching beyond the taken branch. Third, misaligned in-structions, from the multiple fetch groups, must be col-lapsed into the fetch buffer. 1.2 Previous Related WorkThe three challenges in high-bandwidth instructionfetching: multiple-branch prediction, multiple fetch groups,and instruction alignment and collapsing, have been ad-dressed in previous studies. These previous works fall intotwo general categories, those that use an enhanced instruc-tion cache [3][16][18][19] and those that use a trace cache[5][8][14][15]. The fundamental difference between themis where in the pipeline instructions are aligned and col-lapsed. Both perform multiple-branch prediction the cyclebefore instruction fetch, and update the predictor at comple-tion. Techniques that use an enhanced instruction cachesupport fetch of non-contiguous blocks with a multi-ported,multi-banked, or multiple copies of the instruction cache.This leads to multiple fetch groups that must be aligned andcollapsed at fetch time, which can increase the fetch laten-cy. The conventional trace cache performs all instructionalignment and collapsing at completion time. At completiontime a fill unit constructs traces of instructions to be storedin the trace cache. The conventional trace cache optimizesthe fetch latency by shifting the complexity of multiple-branch prediction, multiple fetch groups, and instructionalignment and collapsing to the completion time.This work proposes a block-based trace cache that is ableto achieve higher IPC performance while requiring less1063-6897/99/$10.00 (c) 1999 IEEEtrace storage capacity. It also facilitates more flexible traceprediction schemes; for example, partial matching [5] canbe easily and naturally implemented. Furthermore, it has thepotential of enabling clean implementation of other dynam-ic optimizations, such as multi-path execution, dynamicpredication and dynamic multithreading.The block-based trace cache employs a number of novelfeatures. Instead of explicitly storing instructions of a trace,pointers to blocks constituting a trace are stored in a muchsmaller trace table. The block-based trace cache renamesfetch addresses at the basic block level and stores alignedblocks in a block cache. Traces are constructed by accessingthe replicated block cache using block pointers from thetrace table.Figure 1 and Figure 2 illustrate the basic difference be-tween the conventional and the block-based trace caches.Both designs perform next trace prediction in the cycle pri-or to the fetch cycle. The conventional design employs a so-phisticated path-based next trace predictor [8] to producethe next trace identifier which is used in the fetch cycle toaccess the trace cache. Similarly, the block-based tracecache makes the next trace prediction by accessing the tracetable which outputs the identifiers of the blocks in the pre-dicted trace. These block identifiers are used in the fetch cy-cle to access the replicated block cache. Section 3.1.3 willshow the accessing of the replicated block cache does notincrease the fetch latency of the block-based design.This paper starts with a discussion about fetch addressrenaming in Section 2, then uses Section 3 to detail the im-plementation and design trade-offs of the block-based tracecache. Section 4 discusses the simulation methodology forthis work, and Section 5 explores the design space quantita-tively. Performance potential of the block-based trace cacheis quantified and compared with that of perfect branch pre-diction and perfect fetch schemes in Section 6. It is shownthat the block-based trace cache achieves IPC higher thanFigure 1 - The conventional trace cache.Fetch BufferCompletionbr.Trace Cachetrace_idI-CacheExecutionCoreHistoryHashhist.Fill UnitNextTracePred.that of perfect branch prediction and approaching that ofperfect fetch if an idealized trace predictor is used.Section 7 compares the block-based trace cache to the con-ventional trace cache implementation, and demonstrates amuch higher IPC when trace storage capacity is limited.2 Fetch Address RenamingThe implementation of the block-based trace cache


View Full Document

CMU CS 15740 - The Block-based Trace Cache

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download The Block-based Trace Cache
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Block-based Trace Cache and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Block-based Trace Cache 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?