Unformatted text preview:

CPE 631 ILP Dynamic Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovi milenka ece uah edu http www ece uah edu milenka Outline Instruction Level Parallelism ILP Recap Data Dependencies Extended MIPS Pipeline and Hazards Dynamic scheduling with a scoreboard AM LaCASA 2 ILP Concepts and Challenges ILP Instruction Level Parallelism overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions AM LaCASA reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI Ideal pipeline CPI Structural stalls RAW stalls WAR stalls WAW stalls Control stalls Reducing each of the terms of the right hand side minimize CPI and thus increase instruction throughput 3 Two approaches to exploit parallelism Dynamic techniques largely depend on hardware to locate the parallelism Static techniques relay on software AM LaCASA 4 Techniques to exploit parallelism Technique Section in the textbook AM LaCASA Reduces Forwarding and bypassing Section A 2 Data hazard DH stalls Delayed branches A 2 Control hazard stalls Basic dynamic scheduling A 8 DH stalls RAW Dynamic scheduling with register renaming 3 2 WAR and WAW stalls Dynamic branch prediction 3 4 CH stalls Issuing multiple instruction per cycle 3 6 Ideal CPI Speculation 3 7 Data and control stalls Dynamic memory disambiguation 3 2 3 7 RAW stalls w memory Loop Unrolling 4 1 CH stalls Basic compiler pipeline scheduling A 2 4 1 DH stalls Compiler dependence analysis 4 4 Ideal CPI DH stalls Software pipelining and trace scheduling 4 3 Ideal CPI and DH stalls Compiler speculation 4 4 Ideal CPI and D CH stalls 5 Where to look for ILP Amount of parallelism available within a basic block BB straight line code sequence of instructions with no branches in except to the entry and no branches out except at the exit Example Gcc Gnu C Compiler 17 control transfer AM LaCASA 5 or 6 instructions 1 branch Dependencies amount of parallelism in a basic block is likely to be much less than 5 look beyond single block to get more instruction level parallelism Simplest and most common way to increase amount of parallelism among instruction is to exploit parallelism among iterations of a loop Loop Level Parallelism for i 1 i 1000 i x i x i s Vector Processing see Appendix G 6 Definition Data Dependencies Data dependence instruction j is data dependent on instruction i if either of the following holds AM LaCASA Instruction i produces a result used by instruction j or Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i If dependent cannot execute in parallel Try to schedule to avoid hazards Easy to determine for registers fixed names Hard for memory memory disambiguation Does 100 R4 20 R6 From different loop iterations does 20 R6 20 R6 7 Examples of Data Dependencies Loop element LD D F0 0 R1 F0 array ADD D SD D DADUI F4 F0 F2 0 R1 F4 R1 R1 8 add scalar in F2 store result and decrement BNE R1 R2 Loop branch if R1 R2 pointer AM LaCASA 8 Definition Name Dependencies Two instructions use same name register or memory location but don t exchange data AM LaCASA Antidependence WAR if a hazard for HW Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence WAW if a hazard for HW Instruction i and instruction j write the same register or memory location ordering between instructions must be preserved If dependent can t execute in parallel Renaming to remove data dependencies Again Name Dependencies are Hard for Memory Accesses Does 100 R4 20 R6 From different loop iterations does 20 R6 20 R6 9 Where are the name dependencies 1 Loop L D 2 ADD D 3 S D 4 L D 5 ADD D 6 S D 7 L D 8 ADD D 9 S D 10 L D 11 ADD D 12 S D 13 SUBUI 14 BNEZ 15 NOP AM LaCASA F0 0 R1 F4 F0 F2 0 R1 F4 F0 8 R1 F4 F0 F2 8 R1 F4 F0 16 R1 F4 F0 F2 16 R1 F4 F0 24 R1 F4 F0 F2 24 R1 F4 R1 R1 32 R1 LOOP drop DSUBUI BNEZ drop DSUBUI BNEZ drop DSUBUI BNEZ alter to 4 8 How can remove them 10 Where are the name dependencies 1 Loop L D 2 ADD D 3 S D 4 L D 5 ADD D 6 S D 7 L D 8 ADD D 9 S D 10 L D 11 ADD D 12 S D 13 DSUBUI 14 BNEZ 15 NOP F0 0 R1 F4 F0 F2 0 R1 F4 F6 8 R1 F8 F6 F2 8 R1 F8 F10 16 R1 F12 F10 F2 16 R1 F12 F14 24 R1 F16 F14 F2 24 R1 F16 R1 R1 32 R1 LOOP drop DSUBUI BNEZ drop DSUBUI BNEZ drop DSUBUI BNEZ alter to 4 8 AM LaCASA The Orginal register renaming 11 Definition Control Dependencies Example if p1 S1 if p2 S2 S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1 Two constraints on control dependences An instruction that is control dep on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is not control dep on a branch cannot be moved to after the branch so that its execution is controlled by the branch DADDU R5 R6 R7 AM LaCASA L ADD R1 R2 R3 BEQZ R4 L SUB R1 R5 R6 OR R7 R1 R8 12 Dynamically Scheduled Pipelines Overcoming Data Hazards with Dynamic Scheduling Why in HW at run time Works when can t know real dependence at compile time Simpler compiler Code for one machine runs well on another Example DIV D F0 F2 F4 ADD D F10 F0 F8 SUB D F12 F8 F12 SUB D cannot execute because the dependence of ADD D on DIV D causes the pipeline to stall yet SUBD is not data dependent on anything AM LaCASA Key idea Allow instructions behind stall to proceed 14 Overcoming Data Hazards with Dynamic Scheduling cont d Enables out of order execution out of order completion Out of order execution divides ID stage AM LaCASA 1 Issue decode instructions check for structural hazards 2 Read operands wait until no data hazards then read operands Scoreboarding technique for allowing instructions to execute out of order when there are sufficient resources and no data dependencies CDC 6600 1963 15 Scoreboarding Implications Out of order completion WAR WAW hazards DIV D F0 F2 F4 ADD D F10 F0 F8 SUB D F8 F8 F12 Solutions for WAR AM LaCASA DIV D F0 F2 F4 ADD D F10 F0 F8 SUB D F10 F8 F12 Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW must detect hazard stall until other completes Need to have multiple instructions in execution phase multiple execution units or pipelined execution units Scoreboard keeps track of dependencies state …


View Full Document

UAH CPE 631 - Instruction Level Parallelism

Loading Unlocking...
Login

Join to view Instruction Level Parallelism and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Instruction Level Parallelism and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?