U of U CS 6810 - Static ILP, Branch prediction

Unformatted text preview:

1Lecture 7: Static ILP, Branch prediction• Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 2.2-2.6)2Predication• A branch within a loop can be problematic to schedule• Control dependences are a problem because of the needto re-fetch on a mispredict• For short loop bodies, control dependences can beconverted to data dependences by using predicated/conditional instructions3Predicated or Conditional Instructions• The instruction has an additional operand that determineswhether the instr completes or gets converted into a no-op• Example: lwc R1, 0(R2), R3 (load-word-conditional)will load the word at address (R2) into R1 if R3 is non-zero;if R3 is zero, the instruction becomes a no-op• Replaces a control dependence with a data dependence(branches disappear) ; may need register copies for thecondition or for values used by both directionsif (R1 == 0) R2 = R2 + R4else R6 = R3 + R5R4 = R2 + R3R7 = !R1 ; R8 = R2 ;R2 = R2 + R4 (predicated on R7)R6 = R3 + R5 (predicated on R1)R4 = R8 + R3 (predicated on R1)4Complications• Each instruction has one more input operand – moreregister ports/bypassing• If the branch condition is not known, the instruction stalls(remember, these are in-order processors)• Some implementations allow the instruction to continuewithout the branch condition and squash/complete later inthe pipeline – wasted work• Increases register pressure, activity on functional units• Does not help if the br-condition takes a while to evaluate5Support for Speculation• In general, when we re-order instructions, register renamingcan ensure we do not violate register data dependences• However, we need hardware support to ensure that an exception is raised at the correct point to ensure that we do not violate memory dependencesstbrld6Detecting Exceptions• Some exceptions require that the program be terminated(memory protection violation), while other exceptionsrequire execution to resume (page faults)• For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss• In the former case, you want to defer servicing theexception until you are sure the instruction is not speculative• Note that a speculative instruction needs a special opcodeto indicate that it is speculative7Program-Terminate Exceptions• When a speculative instruction experiences an exception,instead of servicing it, it writes a special NotAThing value(NAT) in the destination register• If a non-speculative instruction reads a NAT, it flags theexception and the program terminates (it may not bedesireable that the error is caused by an array access, butthe segfault happens two procedures later)• Alternatively, an instruction (the sentinel) in the speculativeinstruction’s original location checks the register value andinitiates recovery8Memory Dependence Detection• If a load is moved before a preceding store, we mustensure that the store writes to a non-conflicting address,else, the load has to re-execute• When the speculative load issues, it stores its address ina table (Advanced Load Address Table in the IA-64)• If a store finds its address in the ALAT, it indicates that aviolation occurred for that address• A special instruction (the sentinel) in the load’s originallocation checks to see if the address had a violation and re-executes the load if necessary9Dynamic Vs. Static ILP• Static ILP:+ The compiler finds parallelism  no extra hw higher clock speeds and lower power+ Compiler knows what is next  better global schedule- Compiler can not react to dynamic events (cache misses)- Can not re-order instructions unless you providehardware and extra instructions to detect violations(eats into the low complexity/power argument)- Static branch prediction is poor  even staticallyscheduled processors use hardware branch predictors- Building an optimizing compiler is easier said than done• A comparison of the Alpha, Pentium 4, and Itanium (staticallyscheduled IA-64 architecture) shows that the Itanium is notmuch better in terms of performance, clock speed or power10Control Hazards• In the 5-stage in-order processor: assume always takenor assume always not taken; if the branch goes the otherway, squash mis-fetched instructions (momentarily,forget about branch delay slots)• Modern in-order and out-of-order processors: dynamicbranch prediction; instead of a default not-takenassumption, either predict not-taken, or predict taken-to-X, or predict taken-to-Y• Branch predictor: a cache of recent branch outcomes11Pipeline without Branch PredictorIF (br)PCReg ReadCompareBr-targetPC + 4In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch12Pipeline with Branch PredictorIF (br)PCReg ReadCompareBr-targetIn the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branchBranchPredictor13Branch Mispredict Penalty• Assume: no data or structural hazards; only controlhazards; every 5thinstruction is a branch; branchpredictor accuracy is 90%• Slowdown = 1 / (1 + stalls per instruction)• Stalls per instruction = % branches x %mispreds x penalty= 20% x 10% x 1= 0.02• Slowdown = 1/1.02 ; if penalty = 20, slowdown = 1/1.4141-Bit Bimodal Prediction• For each branch, keep track of what happened last timeand use that outcome as the prediction• What are prediction accuracies for branches 1 and 2 below:while (1) {for (i=0;i<10;i++) { branch-1…}for (j=0;j<20;j++) { branch-2…}}152-Bit Bimodal Prediction• For each branch, maintain a 2-bit saturating counter:if the branch is taken: counter = min(3,counter+1)if the branch is not taken: counter = max(0,counter-1)• If (counter >= 2), predict taken, else predict not taken• Advantage: a few atypical branches will not influence theprediction (a better measure of “the common case”)• Especially useful when multiple branches share the samecounter (some bits of the branch PC are used to indexinto the branch predictor)• Can be easily extended to N-bits (in most processors, N=2)16Correlating Predictors• Basic branch prediction: maintain a 2-bit saturatingcounter for each entry (or use 10 branch PC bits to indexinto one of 1024 counters) – captures the recent “common case” for each branch•


View Full Document

U of U CS 6810 - Static ILP, Branch prediction

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Static ILP, Branch prediction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Static ILP, Branch prediction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Static ILP, Branch prediction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?