Administrivia HW 2 on pipelining due Thursday Continue reading Chapter 2 of H P First exam scheduled for Thursday March 5 CMSC 411 Computer Systems Architecture Lecture 8 Instruction Level Parallelism on Units 1 3 Alan Sussman als cs umd edu l d d CMSC 411 8 from Patterson Outline 2 Control Dependencies ILP p techniques q to increase ILP Compiler Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling Start Tomasulo Algorithm Conclusion CMSC 411 8 from Patterson Every instruction is control dependent on some set of branches branches and and in general general these control dependencies must be preserved to preserve program order if p1 S1 if p2 S2 S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1 3 CMSC 411 8 from Patterson 4 Control Dependence Ignored Exception Behavior Preserving exception behavior Control dependence need not be preserved willing to execute instructions that should not have been executed thereby violating the control dependences if can do so without affecting correctness of the program Instead 2 properties critical to program correctness are 1 exception i b behavior h i and d 2 data flow anyy changes g in instruction execution order must not change g how exceptions are raised in program no new exceptions Example DADDU BEQZ LW L1 L1 R2 R3 R4 R2 R3 R4 R2 L1 R1 0 R2 Assume branches not delayed Problem with moving LW before BEQZ CMSC 411 8 from Patterson 5 Data Flow branches make flow dynamic determine which instruction is supplier of data Example L 6 Outline Data flow actual flow of data values among instructions that produce results and those that consume them DADDU BEQZ DSUBU OR CMSC 411 8 from Patterson R1 R2 R3 R4 L R1 R5 R6 R7 R1 R8 OR depends on DADDU or DSUBU M Must preserve data d fl flow on execution i CMSC 411 8 from Patterson 7 ILP p techniques q to increase ILP Compiler Loop Unrolling Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling Start Tomasulo Algorithm Conclusion CMSC 411 8 from Patterson 8 Software Techniques q Example p FP Loop p Where are the Hazards First translate into MIPS code To simplify assume 8 is lowest address This code add a scalar to a vector for i 1000 i 0 i i 1 for i 1000 i 0 i i 1 x i x i s x i x i s Assume following latencies for all examples Ignore delayed branch in these examples Instruction producing d i result lt FP ALU op FP ALU op L d double Load d bl Load double Integer op Instruction using i result lt Another FP ALU op Store double FP ALU op Store double Integer op Loop L D Latency iin cycles l 4 3 1 1 1 stalls between iin cycles l 3 2 1 0 0 F0 0 R1 F0 vector element ADD D F4 F0 F2 add scalar from F2 S D 0 R1 F4 store result DADDUI R1 R1 8 decrement pointer 8B DW BNEZ R1 Loop 9 CMSC 411 8 from Patterson branch R1 zero 10 CMSC 411 8 from Patterson for i 1000 i 0 i i 1 FP Loop Showing Stalls 1 Loop 2 3 4 5 6 7 8 9 L D F0 0 R1 stall ADD D F4 F0 F2 stall stall S S D D 0 R1 F4 0 R1 F4 DADDUI R1 R1 8 stall BNEZ Instruction producing result FP ALU op FP ALU op Load double R1 Loop p x i x i s F0 vector element add scalar in F2 plus branch delay Revised FP Loop p Minimizing g Stalls 1 Loop 2 3 4 5 store result decrement pointer 8B DW assumes can t forward to branch branch R1 zero Instruction using result Another FP ALU op Store double FP ALU op 6 7 F0 0 R1 R1 R1 8 F4 F0 F2 8 R1 F4 altered offset when move DSUBUI R1 Loop S Swap DADDUI and dS S D Db by changing h i address dd off S S D D Instruction producing result FP ALU op FP ALU op Load double Latency in clock cycles 3 2 1 Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 7 clock cycles but just 3 for execution L D ADD D S D 4 for loop overhead How make faster 9 clock cycles Rewrite code to minimize stalls CMSC 411 8 from Patterson L D DADDUI ADD D stall stall S D BNEZ 11 CMSC 411 8 from Patterson 12 Unroll Loop Four Times straightforward way 1 Loop L D 3 ADD D 6 S D 7 L D 9 ADD D 12 S S D D 13 L D 15 ADD D 18 S D 19 L D 21 ADD D 24 S D 25 DADDUI 27 BNEZ F0 0 R1 F4 F0 F2 0 R1 F4 F6 8 R1 F8 F6 F2 8 R1 F8 8 R1 F8 F10 16 R1 F12 F10 F2 16 R1 F12 F14 24 R1 F16 F14 F2 24 R1 F16 R1 R1 R1 32 R1 32 R1 LOOP 1 cycle stall Rewrite loop to minimize stalls y stall 2 cycles drop DSUBUI BNEZ Unrolled Loop Detail Do not usually know upper bound of loop Suppose it is n n and we would like to unroll the loop to make k copies of the body Instead of a single unrolled loop we generate a pair of consecutive loops drop d DSUBUI BNEZ 1st executes n mod k times and has a body that is the original loop drop p DSUBUI BNEZ 2nd is the unrolled body surrounded by an outer loop that iterates n k times For large values of n n most of the execution time will be spent in the unrolled loop alter to 4 8 27 clock cycles or 6 75 per iteration Assumes R1 is multiple of 4 CMSC 411 8 from Patterson 13 Unrolled Loop That Minimizes Stalls 1 Loop 2 3 4 5 6 7 8 9 10 11 12 13 14 L D L D L D L D ADD D ADD ADD D D ADD D ADD D S D S D S D DSUBUI F0 0 R1 F6 8 R1 F10 16 R1 F14 24 R1 F4 F0 F2 F8 F8 F6 F2 F6 F2 F12 F10 F2 F16 F14 F2 0 R1 F4 8 R1 F8 16 R1 F12 R1 R1 32 S D BNEZ 8 R1 F16 8 8 32 32 24 24 R1 LOOP 14 5 Loop Unrolling Decisions Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences 1 Determine loop unrolling useful by finding that loop iterations were independent except for maintenance code 2 Use different registers to avoid unnecessary constraints forced by 2 using same registers for different computations 3 Eliminate the extra test and branch instructions and adjust the loop termination and iteration code 4 Determine that loads and …
View Full Document