DOC PREVIEW
U of U CS 6810 - Static ILP, Branch prediction

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 301Lecture 6: Static ILP, Branch prediction• Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 3.2-3.3)• No class on Thursday 2nd Feb• Move final from in-class to finals week?• Homework 2 due next Tuesday2Superscalar Pipelines Integer pipeline FP pipelineLoop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) ADD.D F4,F0,F2 L.D F14,-24(R1) ADD.D F8,F6,F2 L.D F18,-32(R1) ADD.D F12,F10,F2 S.D F4,0(R1) ADD.D F16,F14,F2 S.D F8,-8(R1) ADD.D F20,F18,F2 S.D F12,-16(R1) DADDUI R1,R1,# -40 S.D F16,16(R1) BNE R1,R2,Loop S.D F20,8(R1)• Need unroll by degree 5 to eliminate stalls• The compiler may specify instructions that can be issued as one packet• The compiler may specify a fixed number of instructions in each packet: Very Large Instruction Word (VLIW)3Software Pipeline?!L.D ADD.D S.DDADDUI BNEL.D ADD.D S.DL.D ADD.D S.DL.D ADD.D S.DL.D ADD.DL.D ADD.DDADDUI BNEDADDUI BNEDADDUI BNEDADDUI BNEDADDUI BNE……Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop4Software PipelineL.D ADD.D S.DL.D ADD.D S.DL.D ADD.D S.DL.D ADD.D S.DL.D ADD.D S.DL.D ADD.D S.DL.D ADD.DL.DOriginal iter 1Original iter 2Original iter 3Original iter 4New iter 1New iter 2New iter 3New iter 45Software PipeliningLoop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop • Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead• Disadvantages: does not reduce loop overhead, may require more registers6Predication• A branch within a loop can be problematic to schedule• Control dependences are a problem because of the need to re-fetch on a mispredict• For short loop bodies, control dependences can be converted to data dependences by using predicated/conditional instructions7Predicated or Conditional Instructions• The instruction has an additional operand that determines whether the instr completes or gets converted into a no-op• Example: lwc R1, 0(R2), R3 (load-word-conditional) will load the word at address (R2) into R1 if R3 is non-zero; if R3 is zero, the instruction becomes a no-op• Replaces a control dependence with a data dependence (branches disappear) ; may need register copies for the condition or for values used by both directionsif (R1 == 0) R2 = R2 + R4else R6 = R3 + R5 R4 = R2 + R3R7 = !R1 ; R8 = R2 ;R2 = R2 + R4 (predicated on R7)R6 = R3 + R5 (predicated on R1)R4 = R8 + R3 (predicated on R1)8Complications• Each instruction has one more input operand – more register ports/bypassing• If the branch condition is not known, the instruction stalls (remember, these are in-order processors)• Some implementations allow the instruction to continue without the branch condition and squash/complete later in the pipeline – wasted work• Increases register pressure, activity on functional units• Does not help if the br-condition takes a while to evaluate9Support for Speculation• In general, when we re-order instructions, register renaming can ensure we do not violate register data dependences• However, we need hardware support to ensure that an exception is raised at the correct point to ensure that we do not violate memory dependences st brld10Detecting Exceptions• Some exceptions require that the program be terminated (memory protection violation), while other exceptions require execution to resume (page faults)• For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss• In the former case, you want to defer servicing the exception until you are sure the instruction is not speculative• Note that a speculative instruction needs a special opcode to indicate that it is speculative11Program-Terminate Exceptions• When a speculative instruction experiences an exception, instead of servicing it, it writes a special NotAThing value (NAT) in the destination register• If a non-speculative instruction reads a NAT, it flags the exception and the program terminates (it may not be desireable that the error is caused by an array access, but the segfault happens two procedures later)• Alternatively, an instruction (the sentinel) in the speculative instruction’s original location checks the register value and initiates recovery12Memory Dependence Detection• If a load is moved before a preceding store, we must ensure that the store writes to a non-conflicting address, else, the load has to re-execute• When the speculative load issues, it stores its address in a table (Advanced Load Address Table in the IA-64)• If a store finds its address in the ALAT, it indicates that a violation occurred for that address• A special instruction (the sentinel) in the load’s original location checks to see if the address had a violation and re-executes the load if necessary13Dynamic Vs. Static ILP• Static ILP:+ The compiler finds parallelism  no extra hw  higher clock speeds and lower power+ Compiler knows what is next  better global schedule- Compiler can not react to dynamic events (cache misses)- Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument)- Static branch prediction is poor  even statically scheduled processors use hardware branch predictors- Building an optimizing compiler is easier said than done• A comparison of the Alpha, Pentium 4, and Itanium (statically scheduled IA-64 architecture) shows that the Itanium is not much better in terms of performance, clock speed or


View Full Document

U of U CS 6810 - Static ILP, Branch prediction

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download Static ILP, Branch prediction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Static ILP, Branch prediction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Static ILP, Branch prediction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?