Page 1 1 CS6810 School of Computing University of Utah ILP Basics & Branch Prediction Today’s topics: Compiler hazard mitigation loop unrolling SW pipelining Branch Prediction 2 CS6810 School of Computing University of Utah ILP • Parallelism independent enough e.g. avoid stalls » control – correctly predict decision • or use branch delay slots via proper scheduling » data – forwarding or instruction scheduling » structural – duplicate resources • or avoid conflict via scheduling hmm – scheduling looks like the key • What schedules? compiler » knows pipeline and latencies » and source code • note: programmers can help by writing clean code » can’t know some run time status however • e.g. how data dependent conditions resolve HW » needs to pitch in where the compiler can’tPage 2 3 CS6810 School of Computing University of Utah Basic Block Problems • Avg. dynamic branch frequency = 15%-25% branch every 3-6 instructions » ILP is going to be hard to find • Focus on loops major part of the execution time in the common case » Amdahl’s shouts in our ears here » optimize this equation over loops • Some loops are easy (basically vector ops) » known vector size & immediate value known to compiler » 10 cycles: L.D, RAW stall, ADD.D, 2 RAW stalls, S.D, DADDUI, RAW stall, BNE, branch delay stall 4 CS6810 School of Computing University of Utah Smarter Schedule • 6 cycles but still 1 stall need larger loop body in order to have a chance consider » DADDUI, BNE are loop overhead – 40% of total » rest are the actual work Loop: L.D F0, 0(R1) stall ADD.D F4, F0, F2 stall stall S.D F4, 0(R1) DADDUI R1, R1,# -8 stall BNE R1, R2, Loop stall Loop: L.D F0, 0(R1) DADDUI R1, R1,# -8 ADD.D F4, F0, F2 stall BNE R1, R2, Loop S.D F4, 8(R1) Note key issue – S.D L.D potential for RAW stall if target is same memory addr. e.g. disambiguation problem. No intervening R1 mod and 0!=8 so no problemPage 3 5 CS6810 School of Computing University of Utah Loop Unrolling Bigger Basic Block • Basic idea take n loop bodies and catentate them » can’t use the same target registers or Wax stalls are a problem • increased register pressure limits value of n » adjust termination code » adjust offset values • only possible if value is immediate or a known constant in a register • Next idea schedule instructions to avoid existing stalls » a common case is shuffle rather than catenate 6 CS6810 School of Computing University of Utah 4x Unroll Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop Simple Unroll: 12 work instructions, 2 overhead instructions How many cycles per loop?Page 4 7 CS6810 School of Computing University of Utah 5x Unroll Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1,R2, Loop Simple Unroll How many cycles per loop? 4x(1 post L.D. stall + 2 post ADD.D stalls) = 12 + 1 post DADDUI and 1 post BNE stall 14 total + 14 instructions crap still only 50% efficient Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) Schedule – mostly a shuffle no stalls 14 instructions & 4 loops 3.5 cycles per iteration = 2.857x speedup over 10 cycle original loop 1.7x speedup over scheduled unrolled loop 8 CS6810 School of Computing University of Utah Software Pipelining • Similar to loop unrolling but shuffle first often referred to as symbolic loop unrolling register/name management can be tricky » but same idea – create a single loop body add this to an already unrolled loopPage 5 9 CS6810 School of Computing University of Utah Benefit Idea 10 CS6810 School of Computing University of Utah Dependency Tactic Synopsis • Consider when scheduling and unrolling data/RAW » unrolling can provide more independent instructions • up to register availability limit » schedule to remove RAW stalls name/Wax » rename to use different target registers » removes WAx stalls control » the tricky part: scheduling across branches • simple in this example since there were no loop carried dependencies • easy when iteration count and offset values are known constants » much harder when things aren’t vector opsPage 6 11 CS6810 School of Computing University of Utah Control Dependence Worries • Conditional branches instructions before the branch are “uncontrolled” instructions after the branch are “controlled” • Scheduling constraints must preserve controlled and uncontrolled nature of the original instructions note: control over multiple branches is transitive • Simple in-order pipelines instruction order is preserved » so compiler can handle the schedule • except for branch direction and memory latency uncertainties out of order completion of EX stage introduces
View Full Document