Unformatted text preview:

CPE 631 Lecture 17 Exploiting ILP with SW Approaches Aleksandar Milenkovi milenka ece uah edu Electrical and Computer Engineering University of Alabama in Huntsville CPE 631 AM Outline Basic Pipeline Scheduling and Loop Unrolling Multiple Issue Superscalar VLIW Software Pipelining HW Speculation Extensions to Tomasulo s Algorithm 14 01 19 UAH CPE631 2 CPE 631 AM ILP Concepts and Challenges ILP Instruction Level Parallelism overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI Ideal pipeline CPI Structural stalls RAW stalls WAR stalls WAW stalls Control stalls Reducing each of the terms of the right hand side minimize CPI and thus increase instruction throughput 14 01 19 UAH CPE631 3 CPE 631 AM Basic Pipeline Scheduling Example Simple loop Assumptions for i 1 i 1000 i x i x i s Instruction producing result FP ALU op FP ALU op Load double Load double Integer op Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op Latency in clock cycles 3 2 1 0 0 R1 points to the last element in the array for simplicity we assume that x 0 is at the address 0 Loop L D F0 0 R1 F0 array el ADD D F4 F0 F2 add scalar in F2 S D 0 R1 F4 store result SUBI R1 R1 8 decrement pointer BNEZ R1 Loop branch 14 01 19 UAH CPE631 4 CPE 631 AM Executing FP Loop 1 Loop 2 3 4 5 6 7 8 9 10 Instruction producing result FP ALU op FP ALU op Load double Load double Integer op 14 01 19 LD Stall ADDD Stall Stall SD SUBI Stall BNEZ Stall F0 0 R1 F4 F0 F2 10 clocks per iteration 5 stalls Rewrite code to minimize stalls 0 R1 F4 R1 R1 8 R1 Loop Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op Latency in clock cycles 3 2 1 0 0 UAH CPE631 5 CPE Revised FP loop to 631 AM 1 Loop LD F0 0 R1 2 SUBI R1 R1 8 3 ADDD F4 F0 F2 4 Stall 5 BNEZ R1 Loop 6 SD 8 R1 F4 minimise stalls Swap BNEZ and SD by changing address of SD SUBI is moved up delayed branch altered and interch SUBI 6 clocks per iteration 1 stall but only 3 instructions do the actual work processing the array LD ADDD SD Unroll loop 4 times to improve potential for instr scheduling Instruction producing result FP ALU op FP ALU op Load double Load double Integer op 14 01 19 Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op UAH CPE631 Latency in clock cycles 3 2 1 0 0 6 CPE Unrolled Loop that Minimise Stalls 631 AM Loop LD F0 0 R1 This loop will run 14 cycles LD F6 8 R1 no stalls per iteration or 14 4 3 5 for each element LD F10 16 R1 LD F14 24 R1 Assumptions that make this ADDD F4 F0 F2 possible ADDD F8 F6 F2 move LDs before SDs ADDD F12 F10 F2 move SD after SUBI and BNEZ ADDD F16 F14 F2 use different registers SD 0 R1 F4 SD 8 R1 F8 When is it safe for compiler SUBI R1 R1 32 to do such changes SD 16 R1 F12 BNEZ R1 Loop SD 8 R1 F4 14 01 19 UAH CPE631 7 CPE 631 AM Multiple Issue Processors Two variations Superscalar varying no instructions cycle 1 to 8 scheduled by compiler or by HW Tomasulo IBM PowerPC Sun UltraSparc DEC Alpha HP 8000 Very Long Instruction Words V LIW fixed number of instructions 4 16 scheduled by the compiler put ops into wide templates Crusoe VLIW processor www transmeta com Intel Architecture 64 IA 64 64 bit address Style Explicitly Parallel Instruction Computer EPIC Anticipated success lead to use of Instructions Per Clock cycle IPC vs CPI 14 01 19 UAH CPE631 8 CPE 631 AM Superscalar MIPS Superscalar MIPS 2 instructions 1 FP 1 anything else Fetch 64 bits clock cycle Int on left FP on right Can only issue 2nd instruction if 1st instruction issues More ports for FP registers to do FP load FP op in a pair 10 5 I IF ID Ex Mem WB FP IF ID Ex Mem WB I IF ID Ex Mem WB FP IF ID Ex Mem WB IF ID Ex Mem WB IF ID Ex Mem WB I FP Instr 14 01 19 Time clocks Note FP operations extend EX cycle UAH CPE631 9 CPE 631 AM Loop Unrolling in Superscalar Integer Instr 1 Loop FP Instr LD F0 0 R1 2 LD F6 8 R1 3 LD F10 16 R1 ADDD F4 F0 F2 4 LD F14 24 R1 ADDD F8 F6 F2 5 LD F18 32 R1 ADDD F12 F10 F2 6 SD 0 R1 F4 ADDD F16 F14 F2 7 SD 8 R1 F8 ADDD F20 F18 F2 8 SD 16 R1 F12 9 SUBI R1 R1 40 10 SD 16 R1 F16 11 BNEZ R1 Loop 12 SD 8 R1 F20 14 01 19 UAH CPE631 Unrolled 5 times to avoid delays This loop will run 12 cycles no stalls per iteration or 12 5 2 4 for each element of the array 10 CPE 631 AM The VLIW Approach Ii VLIWs use multiple independent functional units VLIWs package the multiple operations into one very long instruction Compiler is responsible to choose instructions to be issued simultaneously IF Ii 1 ID E E E W IF ID E E E Instr 14 01 19 Time clocks W UAH CPE631 11 CPE 631 AM Loop Unrolling in VLIW Mem Ref1 Mem Ref 2 1 LD F2 0 R1 LD F6 8 R1 2 LD F10 16 R1 LD F14 24 R1 3 LD F18 32 R1 LD F22 40 R1 FP1 FP2 ADDD F4 F0 F2 ADDD F8 F0 F6 4 LD F26 48 R1 ADDD F12 F0 F10 ADDD F16 F0 F14 5 ADDD F20 F0 F18 ADDD F24 F0 F22 Int Branch 6 SD 0 R1 F4 SD 8 R1 F8 ADDD F28 F0 F26 7 SD 16 R1 F12 SD 24 R1 F16 SUBI R1 R1 56 8 SD 24 R1 F20 SD 16 R1 F24 BNEZ R1 Loop 9 SD 8 R1 F28 Unrolled 7 times to avoid delays 7 results in 9 clocks or 1 3 clocks per each element 1 8X Average 2 5 ops per clock 50 efficiency Note Need more registers in VLIW 15 vs 6 in SS 14 01 19 UAH CPE631 12 CPE 631 AM Multiple Issue Challenges While Integer FP split is simple for the HW get CPI of 0 5 only for programs with Exactly 50 FP operations No hazards If more instructions issue at same time greater difficulty of decode and issue Even 2 scalar examine 2 opcodes 6 register specifiers decide if 1 or 2 instructions can issue VLIW tradeoff instruction space for simple decoding The long instruction …


View Full Document

UAH CPE 631 - Lecture Notes

Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?