Page 1CS252/CullerLec 18. 14/2/02CS252Graduate Computer ArchitectureLecture 18: Branch Prediction + analysis resources => ILPApril 2, 2002Prof. David E. CullerComputer Science 252Spring 2002CS252/CullerLec 18. 24/2/02Today’s Big Idea• Reactive: past actions cause system to adapt use– do what you did before better– ex: caches– TCP windows– URL completion, ...• Proactive: uses past actions to predict future actions– optimize speculatively, anticipate what you are about to do– branch prediction– long cache blocks– ???CS252/CullerLec 18. 34/2/02Review: Case for Branch Prediction when Issue N instructions per clock cycle1. Branches will arrive up to n times faster in an n-issue processor 2. Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor conversely, need branch prediction to ‘see’ potential parallelismCS252/CullerLec 18. 44/2/02Review: 7 Branch Prediction Schemes1. 1-bit Branch-Prediction Buffer2. 2-bit Branch-Prediction Buffer3. Correlating Branch Prediction Buffer4. Tournament Branch Predictor5. Branch Target Buffer6. Integrated Instruction Fetch Units7. Return Address PredictorsCS252/CullerLec 18. 54/2/02Review: Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of misprediction)• Branch History Table: Lower bits of PC address index table of 1-bit values– Says whether or not branch taken last time– No address check (saves HW, but may not be right branch)• Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit):– End of loop case, when it exits instead of looping as before– First time through loop on next time through code, when it predicts exit instead of looping– Only 80% accuracy even if loop 90% of the timeCS252/CullerLec 18. 64/2/02• Better Solution: 2-bit scheme where change prediction only if get misprediction twice:• Red: stop, not taken• Green: go, taken• Adds hysteresis to decision making processReview: Dynamic Branch Prediction(Jim Smith, 1981)TTNTPredict TakenPredict Not TakenPredict TakenPredict Not TakenTNTTNTNTPage 2CS252/CullerLec 18. 74/2/02Consider 3 Scenarios• Branch for loop test• Check for error or exception• Alternating taken / not-taken– example?• Your worst-case prediction scenarioCS252/CullerLec 18. 84/2/02Correlating BranchesIdea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)– Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction • (2,2) predictor: 2-bit global, 2-bit localBranch address (4 bits)2-bits per branch local predictorsPredictionPrediction2-bit recent global branch history(01 = not taken then taken)CS252/CullerLec 18. 94/2/020%1%5%6% 6%11%4%6%5%1%0%2%4%6%8%10%12%14%16%18%20%4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)Accuracy of Different Schemes(Figure 3.15, p. 206)4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT0%18%Frequency of MispredictionsWhat’s missing in this picture?CS252/CullerLec 18. 104/2/02Re-evaluating Correlation• Several of the SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches:program branch % static # = 90%compress 14% 236 13eqntott 25% 494 5gcc 15% 9531 2020mpeg 10% 5598 532real gcc 13% 17361 3214• Real programs + OS more like gcc• Small benefits beyond benchmarks for correlation? problems with branch aliases?CS252/CullerLec 18. 114/2/02BHT Accuracy• Mispredict because either:– Wrong guess for that branch– Got branch history of wrong branch when index the table• 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%• For SPEC92,4096 about as good as infinite tableCS252/CullerLec 18. 124/2/02Tournament Predictors• Motivation for correlating branch predictors is 2-bit predictor failed on important branches; by adding global information, performance improved• Tournament predictors: use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector• Hopes to select right predictor for right branch (or right context of branch)Page 3CS252/CullerLec 18. 134/2/02Dynamically finding structure in Spaghetti?CS252/CullerLec 18. 144/2/02Tournament Predictor in Alpha 21264• 4K 2-bit counters to choose from among a global predictor and a local predictor• Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor– 12-bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken; • Local predictor consists of a 2-level predictor: – Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted. – Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!(~180,000 transistors)CS252/CullerLec 18. 154/2/02% of predictions from local predictor in Tournament Prediction Scheme98%100%94%90%55%76%72%63%37%69%0% 20% 40% 60% 80% 100%nasa7matrix300tomcatvdoducspicefppppgccespressoeqntottliCS252/CullerLec 18. 164/2/0294%96%98%98%97%100%70%82%77%82%84%99%88%86%88%86%95%99%0% 20% 40% 60% 80% 100%gccespressolifppppdoductomcatvBranch prediction accuracyProfile-based2-bit counterTournamentAccuracy of Branch Prediction• Profile: branch profile from last execution(static in that in encoded in instruction, but profile)fig 3.40CS252/CullerLec 18. 174/2/02Accuracy v. Size (SPEC89)0%1%2%3%4%5%6%7%8%9%10%0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128Total predictor size (Kbits)LocalCorrelatingTournamentCS252/CullerLec 18. 184/2/02Need Address at Same Time as Prediction• Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)– Note: must check for branch match now, since can’t use wrong branch address (Figure 3.19, 3.20)Branch PC Predicted PC=?PC of instructionFETCHExtra prediction statebitsYes: instruction is branch and use predicted PC as next PCNo: branch not predicted, proceed normally(Next PC =
View Full Document