CS252 Graduate Computer Architecture Lecture 18: Branch Prediction + analysis resources => ILPToday’s Big IdeaReview: Case for Branch Prediction when Issue N instructions per clock cycleReview: 7 Branch Prediction SchemesReview: Dynamic Branch PredictionReview: Dynamic Branch Prediction (Jim Smith, 1981)Consider 3 ScenariosCorrelating BranchesAccuracy of Different Schemes (Figure 3.15, p. 206)Re-evaluating CorrelationBHT AccuracyTournament PredictorsDynamically finding structure in SpaghettiTournament Predictor in Alpha 21264% of predictions from local predictor in Tournament Prediction SchemeAccuracy of Branch PredictionAccuracy v. Size (SPEC89)Need Address at Same Time as PredictionPredicated ExecutionSpecial Case Return AddressesPitfall: Sometimes bigger and dumber is betterDynamic Branch Prediction SummaryAdministriviaGetting CPI < 1: Issuing Multiple Instructions/CycleGetting CPI < 1: Issuing Multiple Instructions/CycleMultiple Issue IssuesMultiple Issue ChallengesDynamic Scheduling in Superscalar The easy wayRegister renaming, virtual registers versus Reorder BuffersHow much to speculate?Limits to ILPSlide 32Upper Limit to ILP: Ideal Machine (Figure 3.35 p. 242)More Realistic HW: Branch Impact Figure 3.37More Realistic HW: Renaming Register Impact Figure 3.41More Realistic HW: Memory Address Alias Impact Figure 3.44Realistic HW: Window Impact (Figure 3.46)How to Exceed ILP Limits of this study?Workstation Microprocessors 3/2001SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.comConclusionCS252/CullerLec 18.14/2/02CS252Graduate Computer ArchitectureLecture 18: Branch Prediction + analysis resources => ILPApril 2, 2002Prof. David E. CullerComputer Science 252Spring 2002CS252/CullerLec 18.24/2/02Today’s Big Idea•Reactive: past actions cause system to adapt use–do what you did before better–ex: caches–TCP windows–URL completion, ...•Proactive: uses past actions to predict future actions–optimize speculatively, anticipate what you are about to do–branch prediction–long cache blocks–???CS252/CullerLec 18.34/2/02Review: Case for Branch Prediction when Issue N instructions per clock cycle1. Branches will arrive up to n times faster in an n-issue processor 2. Amdahl’s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor conversely, need branch prediction to ‘see’ potential parallelismCS252/CullerLec 18.44/2/02Review: 7 Branch Prediction Schemes1. 1-bit Branch-Prediction Buffer2. 2-bit Branch-Prediction Buffer3. Correlating Branch Prediction Buffer4. Tournament Branch Predictor5. Branch Target Buffer6. Integrated Instruction Fetch Units7. Return Address PredictorsCS252/CullerLec 18.54/2/02Review: Dynamic Branch Prediction•Performance = ƒ(accuracy, cost of misprediction)•Branch History Table: Lower bits of PC address index table of 1-bit values–Says whether or not branch taken last time–No address check (saves HW, but may not be right branch)•Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit):–End of loop case, when it exits instead of looping as before–First time through loop on next time through code, when it predicts exit instead of looping–Only 80% accuracy even if loop 90% of the timeCS252/CullerLec 18.64/2/02•Better Solution: 2-bit scheme where change prediction only if get misprediction twice:•Red: stop, not taken•Green: go, taken•Adds hysteresis to decision making processReview: Dynamic Branch Prediction(Jim Smith, 1981)TTNTPredict TakenPredict Not TakenPredict TakenPredict Not TakenTNTTNTNTCS252/CullerLec 18.74/2/02Consider 3 Scenarios•Branch for loop test•Check for error or exception•Alternating taken / not-taken–example?•Your worst-case prediction scenarioCS252/CullerLec 18.84/2/02Correlating BranchesIdea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)–Then behavior of recent branches selects between, say, 4 predictions of next branch, updating just that prediction •(2,2) predictor: 2-bit global, 2-bit localBranch address (4 bits)2-bits per branch local predictorsPredictionPrediction2-bit recent global branch history(01 = not taken then taken)CS252/CullerLec 18.94/2/020%1%5%6%6%11%4%6%5%1%0%2%4%6%8%10%12%14%16%18%20%4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)Accuracy of Different Schemes(Figure 3.15, p. 206)4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT0%18%Frequency of MispredictionsWhat’s missing in this picture?CS252/CullerLec 18.104/2/02Re-evaluating Correlation•Several of the SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches:program branch % static# = 90%compress 14%236 13eqntott 25%494 5gcc 15%9531 2020mpeg 10%5598 532real gcc 13%17361 3214•Real programs + OS more like gcc•Small benefits beyond benchmarks for correlation? problems with branch aliases?CS252/CullerLec 18.114/2/02BHT Accuracy•Mispredict because either:–Wrong guess for that branch–Got branch history of wrong branch when index the table•4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%•For SPEC92,4096 about as good as infinite tableCS252/CullerLec 18.124/2/02Tournament Predictors•Motivation for correlating branch predictors is 2-bit predictor failed on important branches; by adding global information, performance improved•Tournament predictors: use 2 predictors, 1 based on global information and 1 based on local information, and combine with a selector•Hopes to select right predictor for right branch (or right context of branch)CS252/CullerLec 18.134/2/02Dynamically finding structure in Spaghetti?CS252/CullerLec 18.144/2/02Tournament Predictor in Alpha 21264•4K 2-bit counters to choose from among a global predictor and a local predictor•Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor–12-bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken; •Local predictor consists of a 2-level predictor: –Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10
View Full Document