MICROPROCESSOR REPORT New Algorithm Improves Branch Prediction Better Accuracy Required for Highly Superscalar Designs by Linley Gwennap Intel s P6 processor see 090202 PDF is the first to use a two level branch prediction algorithm to improve accuracy This algorithm first published by Tse Yu Yeh and Yale Patt has the potential to push accuracy well beyond the 90 level achieved by the best processors today As future processors look to improve performance by increasing the issue rate and or extending the pipeline depth the two level algorithm is likely to become more common Branch prediction has been a problem for CPU designers since the advent of pipelining A pipelined processor must fetch the next instruction before the current one has executed If the current instruction is a conditional branch the processor must decide whether to fetch from the target address assuming the branch will be taken or from the next sequential address assuming the branch will not be taken An incorrect guess causes the pipeline to stall until it is refilled with valid instructions this delay is called the mispredicted branch penalty Processors with a simple five stage pipeline typically have a two cycle branch penalty For a four way superscalar design however this could mean a loss of eight instructions If the pipeline is extended the branch penalty usually increases resulting in the loss of even more instructions Since programs typically encounter branches every 4 6 instructions inaccurate branch prediction causes a severe performance degradation in highly superscalar or deeply pipelined designs Initial efforts at branch prediction used simple algorithms based on the direction of the branch Among commercial microprocessors the MIPS R6000 pioneered the use of compiler hints to direct branch prediction Digital s 21064 was the first microprocessor to store branch history information with the P6 leading the way to two level prediction This article reviews these earlier algorithms before explaining the new two level method in more detail Simple Hardware Can Achieve 65 For scalar processors with relatively short pipelines branch prediction is less of a concern In fact for processors with a branch delay slot the branch penalty can be as little as one cycle The default prediction method for simple pipelined designs is to assume that branches are not taken always fetching sequential instructions The 486 and most embedded processors use this scheme because of its simplicity and low cost New Algorithm Improves Branch Prediction It turns out however that conditional branches are taken more often than not Most programs make heavy use of loops which repeatedly branch to the same address Simulations show that conditional branches are taken about 60 of the time in the SPECint89 suite and more often in scientific code such as the SPECfp89 benchmarks 1 Thus a simple optimization is to always predict branches to be taken A better algorithm takes into account the direction of the branch Backward branches typically complete loop iterations and thus are taken as much as 80 of the time or more Forward branches are more difficult to predict but tend to be not taken more often than taken Thus by simply looking at the direction of the branch usually available as the sign bit of the offset a processor can predict backward branches taken and forward branches not taken This BTFN algorithm succeeds about 65 of the time for SPECint89 MicroSparc 2 and most PA RISC processors use BTFN With appropriate instruction set hooks the compiler can improve branch prediction accuracy Because it has access to the source code a good compiler can recognize code sequences that are likely to branch such as loops and those that are unlikely to branch such as exception checking Current MIPS and PowerPC chips among others implement special branch instructions that encode the compiler s prediction in a single bit Compilers can take further advantage of these predicted branch instructions by using a technique called profiling or feedback directed compilation After the program is initially compiled it is run using test data to determine the typical direction of each branch the program is then recompiled to adjust the branch prediction bits According to IBM its compilers achieve 75 accuracy on SPECint92 using this technique Dynamic Prediction Uses History The previous algorithms are classified as static schemes because any particular branch is always predicted in the same way whenever it is encountered To achieve greater accuracy dynamic algorithms take into account run time information The processor learns from its mistakes and changes its predictions to match the behavior of each particular branch A dynamic algorithm keeps a record of previous branch behavior allowing it to improve its predictions over time A simple scheme published by James Smith in 1981 2 maintains a single history bit for each branch When a branch is encountered it is predicted to go the Vol 9 No 4 March 27 1995 1995 MicroDesign Resources MICROPROCESSOR REPORT ST 3 Predict Taken Not Taken Not Taken WT WNT 2 1 Taken Taken Predict Not Taken SNT 0 Not Taken Taken Not Taken Taken Figure 1 In the two bit Smith algorithm the two history bits implement a state machine with four possible states strongly taken ST weakly taken WT weakly not taken WNT and strongly not taken SNT In ST and WT future branches are predicted taken in WNT and SNT branches are predicted not taken same way it did the previous time as indicated by the bit This technique can push accuracy to 80 As a practical matter there are two ways to implement this scheme The history bits can be kept in the instruction cache for example one per every four instructions When instructions are fetched from the cache the history bit comes along If the bit is set that group of instructions contains a predicted taken branch and the fetch stream is redirected In this example the storage overhead would be less than 1 of the cache area Although this method used by Digital s Alpha AMD s K5 and other processors provides dynamic prediction with minimal cost it has some drawbacks Some groups of instructions will not contain a branch wasting the history bit Groups with multiple branches create interference as the history of one branch overwrites that of another in the same group Processors such as Pentium store the history bits in a separate branch history table BHT assigning one entry per branch By avoiding the interference and unused bits of
View Full Document