DOC PREVIEW
Berkeley COMPSCI 252 - Lecture 17

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1CS252/PattersonLec 17.13/23/01CS252Graduate Computer ArchitectureLecture 17: ILP and Dynamic Execution #2: BranchPrediction, Multiple IssueMarch 23, 2001Prof. David A. PattersonComputer Science 252Spring 2001CS252/PattersonLec 17.23/23/01Review Tomasulo• Reservations stations: implicit register renaming tolarger set of registers + buffering source operands– Prevents registers as bottleneck– Avoids WAR, WAW hazards of Scoreboard– Allows loop unrolling in HW• Not limited to basic blocks(integer units gets ahead, beyond branches)• Today, helps cache misses as well– Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?)• Lasting Contributions– Dynamic scheduling– Register renaming– Load/store disambiguation• 360/91 descendants are Pentium III; PowerPC 604;MIPS R10000; HP-PA 8000; Alpha 21264CS252/PattersonLec 17.33/23/01Tomasulo Algorithm and BranchPrediction• 360/91 predicted branches, but did notspeculate: pipeline stopped until the branchwas resolved– No speculation; only instructions that can complete• Speculation with Reorder Buffer allowsexecution past branch, and then discard ifbranch fails– just need to hold instructions in buffer until branch cancommitCS252/PattersonLec 17.43/23/01Case for Branch Prediction whenIssue N instructions per clock cycle• Branches will arrive up to n times faster inan n-issue processor• Amdahl’s Law => relative impact of thecontrol stalls will be larger with the lowerpotential CPI in an n-issue processorCS252/PattersonLec 17.53/23/017 Branch Prediction Schemes• 1-bit Branch-Prediction Buffer• 2-bit Branch-Prediction Buffer• Correlating Branch Prediction Buffer• Tournament Branch Predictor• Branch Target Buffer• Integrated Instruction Fetch Units• Return Address PredictorsCS252/PattersonLec 17.63/23/01Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of misprediction)• Branch History Table: Lower bits of PC addressindex table of 1-bit values– Says whether or not branch taken last time– No address check (saves HW, but may not be right branch)• Problem: in a loop, 1-bit BHT will cause2 mispredictions (avg is 9 iterations before exit):– End of loop case, when it exits instead of looping as before– First time through loop on next time through code, when itpredicts exit instead of looping– Only 80% accuracy even if loop 90% of the timePage 2CS252/PattersonLec 17.73/23/01• Solution: 2-bit scheme where change prediction onlyif get misprediction twice: (Figure 3.7, p. 249)• Red: stop, not taken• Green: go, taken• Adds hysteresis to decision making processDynamic Branch Prediction(Jim Smith, 1981)TTNTPredict TakenPredict Not TakenPredict TakenPredict Not TakenTNTTNTNTCS252/PattersonLec 17.83/23/01Correlating BranchesIdea: taken/nottaken of recentlyexecuted branches isrelated to behaviorof next branch (aswell as the history ofthat branch behavior)– Then behavior of recentbranches selectsbetween, say, 4predictions of nextbranch, updating justthat prediction• (2,2) predictor: 2-bitglobal, 2-bit localBranch address (4 bits)2-bits per branch local predictorsPredictionPrediction2-bit global branch history(01 = not taken then taken)CS252/PattersonLec 17.93/23/010%1%5%6% 6%11%4%6%5%1%0%2%4%6%8%10%12%14%16%18%20%4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)Accuracy of Different Schemes(Figure 3.15, p. 257)4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT0%18%Frequency of MispredictionsCS252/PattersonLec 17.103/23/01Re-evaluating Correlation• Several of the SPEC benchmarks have lessthan a dozen branches responsible for 90%of taken branches:program branch % static # = 90%compress 14% 236 13eqntott 25% 494 5gcc 15% 9531 2020mpeg 10% 5598 532real gcc 13% 17361 3214• Real programs + OS more like gcc• Small benefits beyond benchmarks forcorrelation? problems with branch aliases?CS252/PattersonLec 17.113/23/01• Avoid branch prediction by turning branchesinto conditionally executed instructions: if (x) then A = B op C else NOP– If false, then neither store result nor cause exception– Expanded ISA of Alpha, MIPS, PowerPC, SPARC haveconditional move; PA-RISC can annul any followinginstr.– IA-64: 64 1-bit condition fields selectedso conditional execution of any instruction– This transformation is called “if-conversion”• Drawbacks to conditional instructions– Still takes a clock even if “annulled”– Stall if condition evaluated late– Complex conditions reduce effectiveness;condition becomes known late in pipelinexA = B op CPredicated ExecutionCS252/PattersonLec 17.123/23/01BHT Accuracy• Mispredict because either:– Wrong guess for that branch– Got branch history of wrong branch when index thetable• 4096 entry table programs vary from 1%misprediction (nasa7, tomcatv) to 18%(eqntott), with spice at 9% and gcc at 12%• For SPEC92,4096 about as good as infinite tablePage 3CS252/PattersonLec 17.133/23/01Administratrivia• Project meetings on Wednesday– Lots of interesting projects– A few a little behind, need to catchup soon and meetagain• Spring Break next week• When return, 3rd (last) Homework on Ch 3CS252/PattersonLec 17.143/23/01Tournament Predictors• Motivation for correlating branch predictors is2-bit predictor failed on important branches;by adding global information, performanceimproved• Tournament predictors: use 2 predictors, 1based on global information and 1 based onlocal information, and combine with a selector• Hopes to select right predictor for rightbranchCS252/PattersonLec 17.153/23/01Tournament Predictor in Alpha 21264• 4K 2-bit counters to choose from among a globalpredictor and a local predictor• Global predictor also has 4K entries and is indexed bythe history of the last 12 branches; each entry in theglobal predictor is a standard 2-bit predictor– 12-bit pattern: ith bit 0 => ith prior branch not taken;ith bit 1 => ith prior branch taken;• Local predictor consists of a 2-level predictor:– Top level a local history table consisting of 1024 10-bitentries; each 10-bit entry corresponds to the most recent10 branch outcomes for the entry. 10-bit history allowspatterns 10 branches to be discovered and predicted.– Next level Selected entry from the local history table isused to index a table of 1K entries consisting a 3-bitsaturating counters, which provide the local prediction• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K


View Full Document

Berkeley COMPSCI 252 - Lecture 17

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Lecture 17
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 17 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 17 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?