Static Code SchedulingCode SchedulingWhy Must the Compiler Schedule?ExampleSlide 5Code Scheduling StrategyScheduling ApproachesBranch SchedulingRecall from Architecture…Control HazardsData DependencesData HazardsSlide 13Multi-cycle InstructionsAvoiding data hazardsExample: Without SchedulingBasic Block Dependence DAGSExample – Build the DAGCreating a scheduleOperation PriorityComputing PrioritiesExample – Determine Height and CPExample – List SchedulingScheduling vs. Register AllocationRegister RenamingVLIWSample VLIW codeMulti-Issue Scheduling ExampleEarliest Latest SetsList Scheduling AlgorithmImproving Basic Block SchedulingStatic Code SchedulingCS 671April 1, 20082CS 671 – Spring 2008Code SchedulingScheduling or reordering instructions to improve performance and/or guarantee correctness•Important for dynamically-scheduled architectures•Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPICTakes into account anticipated latencies•Machine-specific, performed later in the optimization passHow does this contrast with our earlier exploration of code motion?3CS 671 – Spring 2008Many machines are pipelined and expose some aspects of pipelining to the user (compiler)Examples:•Branch delay slots!•Memory-access delays•Multi-cycle operationsSome machines don’t have scheduling hardwareWhy Must the Compiler Schedule?4CS 671 – Spring 2008ExampleAssume loads take 2 cycles and branches have a delay slot.____cyclesinstruction start timer2 [r1]r3 [r1+4]r4 r2 + r3r5 r2 + 1goto L1nop5CS 671 – Spring 2008ExampleAssume loads take 2 cycles and branches have a delay slot.____cyclesinstruction start timer2 [r1]r3 [r1+4]r5 r2 + 1goto L1r4 r2 + r36CS 671 – Spring 2008Code Scheduling StrategyGet resources operating in parallel•Integer data path•Integer multiply / divide hardware•FP adder, multiplier, dividerMethod•Fill with computations that do not require result or same hardware resourcesDrawbacks•Highly hardware dependentStart OpUse OpTry to fill7CS 671 – Spring 2008Scheduling ApproachesLocalBranch schedulingBasic-block schedulingGlobalCross-block schedulingSoftware pipeliningTrace schedulingPercolation scheduling8CS 671 – Spring 2008Branch SchedulingTwo problems: Branches often take some number of cycles to completeCan be a delay between a compare b and its associated branchA compiler will try to fill these slots with valid instructions (rather than nop)Delay slots – present in PA-RISC, SPARC, MIPSCondition delay – PowerPC, Pentium9CS 671 – Spring 2008Recall from Architecture…IF – Instruction FetchID – Instruction DecodeEX – ExecuteMA – Memory accessWB – Write backIFIFIFIDIDIDEXEXEXMAMAMAWBWBWB10CS 671 – Spring 2008Control HazardsIFIFID---EX---MA--- ---WBIF ID EX MA WBIF ID EX MA WBTaken BranchInstr + 1Branch TargetBranch Target + 111CS 671 – Spring 2008Data DependencesIf two operations access the same register, they are dependentTypes of data dependencesFlowOutput Antir1 = r2 + r3r4 = r1 * 6r1 = r2 + r3r1 = r4 * 6r1 = r2 + r3r2 = r5 * 612CS 671 – Spring 2008Data HazardsIFIFIDIDEXEXMAMA WBWBlw R1,0(R2)add R3,R1,R4stallMemory latency: data not ready13CS 671 – Spring 2008Data HazardsIFIFIDIDEX EX MAMA WBWBaddf R3,R1,R2addf R3,R3,R4stallEX EXAssumes floating point ops take 2 execute cyclesInstruction latency: execute takes > 1 cycle14CS 671 – Spring 2008Multi-cycle Instructions•Scheduling is particularly important for multi-cycle operations•Alpha instructions > 1 cycle latency (partial list)mull (32-bit integer multiply) 8mulq (64-bit integer multiply) 16addt (fp add) 4mult (fp multiply) 4divs (fp single-precision divide) 10divt (fp double-precision divide) 2315CS 671 – Spring 2008Avoiding data hazards•Move loads earlier and stores later (assuming this does not violate correctness) •Other stalls may require more sophisticated re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) •How can we do this in a systematic way??16CS 671 – Spring 2008Example: Without SchedulingStart TimeCodelw r1, wadd r1,r1,r1lw r2,xmult r1,r1,r2lw r2,ymult r1,r1,r2lw r2,zmult r1,r1,r2sw r1, aAssume:• memory instrs take 3 cycles• mult takes 2 cycles (to haveresult in register)• rest take 1 cycle____cycles17CS 671 – Spring 2008Basic Block Dependence DAGSNodes - instructionsEdges - dependence between I1 and I2•When we cannot determine whether there is a dependence, we must assume there is onea) lw R2, (R1)b) lw R3, (R1) 4c) R4 R2 + R3d) R5 R2 - 1a bd c22218CS 671 – Spring 2008Example – Build the DAGCodea lw r1, wb add r1,r1,r1c load r2,xd mult r1,r1,r2e load r2,yf mult r1,r1,r2g load r2,zh mult r1,r1,r2i sw r1, aAssume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle19CS 671 – Spring 2008Creating a schedule•Create a DAG of dependences•Determine priority•Schedule instructions with–Ready operands–Highest priority•Heuristics: If multiple possibilities, fall back on other priority functions20CS 671 – Spring 2008Operation PriorityPriority – Need a mechanism to decide which ops to schedule first (when you have choices)Common priority functions•Height – Distance from exit node–Give priority to amount of work left to do•Slackness – inversely proportional to slack–Give priority to ops on the critical path•Register use – priority to nodes with more source operands and fewer destination operands–Reduces number of live registers •Uncover – high priority to nodes with many children–Frees up more nodes•Original order – when all else fails21CS 671 – Spring 2008Computing PrioritiesHeight(n) =•exec(n) if n is a leaf•max(height(m)) + exec(n) for m, where m is a successor of nCritical path(s) = path through the dependence DAG with longest latency22CS 671 – Spring 2008Example – Determine Height and CPCodea lw r1, wb add r1,r1,r1c lw r2,xd mult r1,r1,r2e lw r2,yf mult r1,r1,r2g lw r2,zh mult r1,r1,r2i sw r1, aAssume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycleCritical path: _______23CS 671 – Spring 2008Example – List SchedulingCodea lw r1, wb add r1,r1,r1c lw r2,xd mult r1,r1,r2e lw r2,yf mult r1,r1,r2g lw r2,zh mult r1,r1,r2i sw r1, astartSchedule_____cycles24CS 671 – Spring 2008Scheduling vs. Register AllocationCodea lw r1 (r12)b lw r2 (r12+4)c r1 r1+r2d stw (r12) r1e lw r1 (r12+8)f lw
View Full Document