NYU CSCI-GA 2243 - Branch Prediction Scoreboarding

Unformatted text preview:

2/10/2006 119G22.2243-001High Performance Computer ArchitectureLecture 4Branch PredictionScoreboardingFeb. 8, 20062/10/2006 120Outline• Announcements– HW Assignment 1 due now– Lab Assignment 1 due back february 15th• Reducing the impact of Hazards[ Hennessy/Patterson CA:AQA (3rd Edition): Appendix A, Chapter 3]2/10/2006 121Recap: Hazards1) StructuralTwo instructions need the same resource– Dual ported memories (separated instruction and data memories)– Dedicated adder for NPC calculation2) Data HazardInstruction depends on result of prior instruction still in the pipeline– RAW, WAW and WAR hazards– Forwarding (registers and results)– Scoreboarding3) Control hazardPipelining of branches & other instructions that change the PC– 3-stall architecture– 1-stall architecture–Static Schemes• Predict branch taken • Predict branch not taken• Delay slot–Dynamic schemes• Branch prediction buffers• Branch target buffers• Return address stacks2/10/2006 122Dynamic Branch Prediction (1): Branch Prediction (History) Buffer• Small memory indexed by the low-order bits of the branch instruction–Stores a single bit of information: T or NT• Starts off as T, flips whenever a branch behaves opposite to prediction– Benefits for larger pipelines, more complex branches• Problems with this simple scheme– Prediction value may not correspond to branch being considered• Cannot avoid this: Branch Prediction Buffer serves as a cache without tags– Does not do a good job of predicting “mostly-taken branches”• E.g, a loop: for (i=0; i<10; i++) { … }• Repeated executions of the loop will result in 2 incorrect predictions– Last iteration flips T to NT– First iteration flips NT to T• So, prediction accuracy of 80%• Can we do better?2/10/2006 123Dynamic Branch Prediction (2):2-bit Prediction Schemes•Store 2 bits of information in branch history buffer• How does this do on our loop example?–1misprediction per iteration if we start off in the (11) state–1 misprediction per iteration (plus 2 mispredictions the first time) if we start in (00) state• Generalization: n-bit saturating counters– Increment if taken, decrement if not– Predict T if value more than half, else NT– Not too much of a win over 2-bit counters2/10/2006 124Prediction Accuracy of 2-bit Prediction Schemes• SPEC89 benchmarks using a 4096-entry 2-bit prediction buffer(Pan, So, and Rameh [1992])•Is hit-rate in the cache an issue?2/10/2006 125Dynamic Branch Prediction (3):Correlating Branch Predictors• 2-bit predictor uses only the recent behavior of a single branch to predict its future behavior• Is branch direction affected by more “global” properties?• Behavior of b3 is correlated with that of b1 and b2– if both b1 and b2 are NT, b3 will be T• Can (how do) we predict such branches?if (aa == 2)aa = 0;if (bb == 2)bb = 0;if (aa != bb){ … }assuming aa and bb are in R1 and R2DSUBUI R3, R1, #2BNEZ R3, L1 ; branch b1DADD R1, R0, R0L1: DSUBUI R3, R2, #2BNEZ R3, L2 ; branch b2DADD R2, R0, R0L2: DSUBU R3, R1, R2BEQZ R3, L3 ; branch b32/10/2006 126A (1,1) Correlating PredictorBehavior of a 1-bit predictor for repeated executions of above with values of d=2, 0, 2, 0,…1-bit predictor that uses 1 bit of correlation– X/Y: X if last branch was NT, Y if last branch was Tif (d == 0)d = 1;if (d == 1){ … }BNEZ R1, L1 ; branch b1DADDUI R1, R0, #1L1: DADDUI R3, R1, #-1BNEZ R3, L2 ; branch b2...L2:noyesyes=1?nonoyes=0?TT2NTT1NTNT0b2b1diniNTNTTNTNTT0TNTTb2newNTTNTb2pTNTTb1newTNTTb1actTNT2NTT0TNT2b2actb1pdiniNT/TNTNT/TT/NTNTT/NT0NT/TNT/TNT/Tb2nNT/TNT/TNT/NTb2pT/NTT/NTT/NTb1nTNTTb1TT/NT2NTT/NT0TNT/NT2b2b1p2/10/2006 127(m,n) Correlating Predictors• Use behavior of the last m branches to choose from among 2mn-bit predictors (for a single branch)– Makes the predictor context-sensitive• Yields improved prediction accuracy for small hardware cost– History of last m branches can be kept as a shift register• Each bit records whether corresponding branch was T/NT– Branch prediction buffer can then be indexed by concatenating the lower-order bits of address with the m-bit history•Variant: gshare– History and branch address bits are xor-edA (2,2) Correlating Predictor2/10/2006 128Prediction Accuracy of Correlating PredictorsIs this a fair comparison?Metric: Number of bits4096 entries (2-bit): 8K1024 entries (2,2): 22×2×1024= 8K2/10/2006 129Tournament Predictors• Adaptively combine local and global predictors– Make more effective use of large numbers of prediction bits• Generalization: Multilevel branch predictors– Several levels of branch prediction tables• Combination of local (with different bits/branch) and global (different m,n)– Selector algorithm that chooses which predictor to apply• A selector should be adaptive (“learn” from its mistakes) • Saturating counters as in local predictors– Increment/decrement based on which predictor is correct (when the other is incorrect)0|1: Predictor 1 was wrong, Predictor 2 was correct2/10/2006 130Performance of Tournament Predictors• Misprediction rate on SPEC89 as total number of bits is increased– “optimal” correlating predictor chosen at each point2/10/2006 131Examples of Branch Predictors in Use• Pentium– 2-bit local predictor• direct jump from (00) to (11) state• Pentium MMX, Pentium Pro, Celeron, Pentium II– (4,2) correlating predictor• Pentium III– 2-level adaptive tournament predictor• details are sketchy– 512-entry Branch Target Buffer (BTB)(next)• Pentium IV– 4096-entry BTB– Execution trace cache (later)• Alpha 21264 used a tournament predictor• Selector–Uses 4K 2-bit counters indexed by the local branch address to select between local and global predictor• Global predictor–4Kentries, indexed by history of last 12 branches– Each entry is a 2-bit predictor– (12,2) correlating predictor• Local predictor is itself 2-level– Top-level: 1K 10-bit history of (local) branch outcomes• Detects patterns– History used to index a 1K 3-bitsaturating counter table2/10/2006 132Branch Target Buffer (BTB)• In the classic 5-stage pipeline, an instruction is identified as a branch only in the ID stage– Branch prediction buffer can help decide whether to fetch from target address (fast pipeline), or fall-through– However, IF still ends up fetching a (possibly) useless instruction• So, even


View Full Document

NYU CSCI-GA 2243 - Branch Prediction Scoreboarding

Download Branch Prediction Scoreboarding
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Branch Prediction Scoreboarding and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Branch Prediction Scoreboarding 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?