U of U CS 6810 - ILP Basics and Branch Prediction

Unformatted text preview:

Page 1 1 CS6810 School of Computing University of Utah ILP Basics & Branch Prediction Today’s topics: Compiler hazard mitigation loop unrolling SW pipelining Branch Prediction 2 CS6810 School of Computing University of Utah ILP • Parallelism  independent enough  e.g. avoid stalls » control – correctly predict decision • or use branch delay slots via proper scheduling » data – forwarding or instruction scheduling » structural – duplicate resources • or avoid conflict via scheduling  hmm – scheduling looks like the key • What schedules?  compiler » knows pipeline and latencies » and source code • note: programmers can help by writing clean code » can’t know some run time status however • e.g. how data dependent conditions resolve  HW » needs to pitch in where the compiler can’tPage 2 3 CS6810 School of Computing University of Utah Basic Block Problems • Avg. dynamic branch frequency = 15%-25%   branch every 3-6 instructions » ILP is going to be hard to find • Focus on loops  major part of the execution time in the common case » Amdahl’s shouts in our ears here » optimize this equation over loops • Some loops are easy (basically vector ops) » known vector size & immediate value known to compiler » 10 cycles: L.D, RAW stall, ADD.D, 2 RAW stalls, S.D, DADDUI, RAW stall, BNE, branch delay stall 4 CS6810 School of Computing University of Utah Smarter Schedule • 6 cycles but still 1 stall  need larger loop body in order to have a chance  consider » DADDUI, BNE are loop overhead – 40% of total » rest are the actual work Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F2 stall stall S.D F4, 0(R1) DADDUI R1, R1,# -8 stall BNE R1, R2, Loop stall Loop: L.D F0, 0(R1) DADDUI R1, R1,# -8 ADD.D F4, F0, F2 stall BNE R1, R2, Loop S.D F4, 8(R1) Note key issue – S.D  L.D potential for RAW stall if target is same memory addr. e.g. disambiguation problem. No intervening R1 mod and 0!=8 so no problemPage 3 5 CS6810 School of Computing University of Utah Loop Unrolling  Bigger Basic Block • Basic idea  take n loop bodies and catentate them » can’t use the same target registers or Wax stalls are a problem • increased register pressure limits value of n » adjust termination code » adjust offset values • only possible if value is immediate or a known constant in a register • Next idea  schedule instructions to avoid existing stalls » a common case is shuffle rather than catenate 6 CS6810 School of Computing University of Utah 4x Unroll Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop Simple Unroll: 12 work instructions, 2 overhead instructions How many cycles per loop?Page 4 7 CS6810 School of Computing University of Utah 5x Unroll Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop Simple Unroll How many cycles per loop? 4x(1 post L.D. stall + 2 post ADD.D stalls) = 12 + 1 post DADDUI and 1 post BNE stall 14 total + 14 instructions  crap still only 50% efficient Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) Schedule – mostly a shuffle no stalls 14 instructions & 4 loops 3.5 cycles per iteration = 2.857x speedup over 10 cycle original loop 1.7x speedup over scheduled unrolled loop 8 CS6810 School of Computing University of Utah Software Pipelining • Similar to loop unrolling but shuffle first  often referred to as symbolic loop unrolling  register/name management can be tricky » but same idea – create a single loop body  add this to an already unrolled loopPage 5 9 CS6810 School of Computing University of Utah Benefit Idea 10 CS6810 School of Computing University of Utah Dependency Tactic Synopsis • Consider when scheduling and unrolling  data/RAW » unrolling can provide more independent instructions • up to register availability limit » schedule to remove RAW stalls  name/Wax » rename to use different target registers » removes WAx stalls  control » the tricky part: scheduling across branches • simple in this example since there were no loop carried dependencies • easy when iteration count and offset values are known constants » much harder when things aren’t vector opsPage 6 11 CS6810 School of Computing University of Utah Control Dependence Worries • Conditional branches  instructions before the branch are “uncontrolled”  instructions after the branch are “controlled” • Scheduling constraints  must preserve controlled and uncontrolled nature of the original instructions  note: control over multiple branches is transitive • Simple in-order pipelines  instruction order is preserved » so compiler can handle the schedule • except for branch direction and memory latency uncertainties  out of order completion of EX stage introduces


View Full Document

U of U CS 6810 - ILP Basics and Branch Prediction

Documents in this Course
Caches

Caches

13 pages

Pipelines

Pipelines

14 pages

Load more
Download ILP Basics and Branch Prediction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ILP Basics and Branch Prediction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ILP Basics and Branch Prediction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?