AbstractIndirect branch prediction is likely to become increasinglyimportant in the future because indirect branches occurmore frequently in object-oriented programs. With mispre-diction rates of around 25% on current processors, indirectbranches can incur a significant fraction of branch mispre-diction overhead even though they remain less frequent thanthe more predictable conditional branches. We investigate awide range of two-level predictors dedicated exclusively toindirect branches. Starting with predictors that use full-pre-cision addresses and unlimited tables, we progressivelyintroduce hardware constraints and minimize the loss of pre-dictor performance at each step. For programs from theSPECint95 suite as well as a suite of large C++ applica-tions, a two-level predictor achieves a misprediction rate of9.8% with a 1K-entry table and 7.3% with an 8K-entry table,representing more than a threefold improvement over anideal BTB. A hybrid predictor further reduces the mispredic-tion rates to 8.98% (1K) and 5.95% (8K).1. IntroductionCurrent high-performance superscalar processors usebranch prediction to speculatively execute instructionsbeyond an unresolved branch. If the branch is mispredicted,this work is lost, and execution must restart right after thebranch instruction. Newer designs increase instructions issuewidth and pipeline depth, increasing the relative overhead ofmispredicted branches and making accurate branch predic-tion even more critical to performance.Conditional direct branches, whose target is encoded inthe instruction itself, can already be predicted with reportedhit rates of up to 97% ([YP93]). In contrast, indirectbranches, which transfer control to an address stored in aregister, are harder to predict accurately. Unlike conditionalbranches, they can have more than two targets so that predic-tion requires a full 32-bit or 64-bit address rather than just a“taken” or “not taken” bit. Current processors predict indi-rect branches with a branch target buffer (BTB) whichcaches the most recent target address of a branch. Unfortu-nately, BTBs typically have much lower prediction rates thanthe best predictors for conditional branches. For example, anideal (unconstrained) BTB achieves an average prediction hitratio of only 64% on the SPECint95 benchmarks.Though not as common as conditional branches, indirectbranches occur frequently enough to cause substantial over-head. Chang et al. [CHP97] predict a reduction in executiontime of 14% and 5% for the perl and gcc benchmarks on awide-issue superscalar processor with an improved predic-tion mechanism for indirect branches (Target Cache).In C++ and Java programs, indirect branches occur witheven higher frequency (see Table 1). These languagespromote a polymorphic programming style in which latebinding of subroutine invocations is the main instrument formodular code design. Virtual function tables, the implemen-tation of choice for most C++ and Java compilers, execute anindirect branch for every polymorphic call. The C++programs studied here execute an indirect branch asfrequently as once every 50 instructions; other studies[CGZ94] have shown similar results. Some of the C++programs in Table 1 execute only 6 conditional branches forevery indirect branch.Predictated instructions [M+94] further increase theimportance of indirect branch prediction since they removeconditional branches and thus conditional branch misses. Forexample, Intel expects predication to reduce the number ofconditional branches by half for the IA-64 architecture[Intel97]. With indirect branches becoming more frequentrelative to conditional branches, and with indirect branchesbeing mispredicted much more frequently, indirect branchprediction misses can start to dominate the overall branchmisprediction cost. For example, if indirect branches aremispredicted 12 times more frequently (36% vs. 3% missratio), indirect branch misses will dominate conditionalbranch misses as long as indirect branches occur morefrequently than every 12 conditional branches.As the relevance of indirect branches grows, so does theopportunity for more sophisticated prediction mechanisms.Accurate Indirect Branch PredictionKarel Driesen and Urs HölzleDepartment of Computer ScienceUniversity of CaliforniaSanta Barbara, CA 93106Reference:Karel Driesen and Urs Hölzle. Accurate Indirect BranchPrediction. ISCA ‘98 Conference Proceedings, pp. 167-178, Barcelona, July 19982In the next decade, uniprocessors may reach one billion tran-sistors, with 48 million transistors dedicated to branch predic-tion ([P+97]).In this study, we explore the design space of predictionmechanisms that are exclusively dedicated to indirectbranches. Since the link between misprediction rate andexecution overhead has been demonstrated in [CHP97], wefocus on the minimization of branch misprediction rate.Initially, we assume unlimited hardware resources so thatresults are not obscured by implementation artifacts such asinterference in tagless tables. We then progressively intro-duce hardware constraints, each of which causes a new typeof interference and corresponding performance loss. Werepeat this process until we obtain implementable predictors.Finally, the practical predictors are pairwise combined into ahybrid predictor, further improving prediction accuracy.2. BenchmarksOur benchmark suite (see Table 1) consists of large object-oriented C++ applications that range from 8,000 to over75,000 non-blank lines of C++ code each., and beta, aaSunSoft version 1.3bJava High-level Class Modifierchardware description language compilerdSUIF 1.0eFresco X11R6 librarycompiler for the Beta programming language ([MMN93]),written in Beta. We also measured the SPECint95 benchmarksuite with the exception of compress which executes only 590branches during a complete run. Together, the benchmarksrepresent over 500,000 non-comment source lines.All C and C++ programs except self1 were compiled withGNU gcc 2.7.2 (options -O2 -msupersparc plus static linking)and run under the shade instruction-level simulator [CK93] toobtain traces of all indirect branches. Procedure returns wereexcluded because they can be predicted accurately with areturn address stack [KE91]. All programs were run tocompletion or until six million indirect branches wereexecuted.2 In jhm and self we excluded the initializationphases by skipping the first 5 and 6 million indirect branches,respectively.For each benchmark, the tables list the number of
or
We will never post anything without your permission.
Don't have an account? Sign up