Unformatted text preview:

CPE 631 Lecture 14 Exploiting ILP with SW Approaches 2 Aleksandar Milenkovi milenka ece uah edu Electrical and Computer Engineering University of Alabama in Huntsville CPE 631 AM Outline Basic Pipeline Scheduling and Loop Unrolling Multiple Issue Superscalar VLIW Software Pipelining 15 01 19 UAH CPE631 2 CPE 631 AM ILP Concepts and Challenges ILP Instruction Level Parallelism overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI Ideal pipeline CPI Structural stalls RAW stalls WAR stalls WAW stalls Control stalls Reducing each of the terms of the right hand side minimize CPI and thus increase instruction throughput 15 01 19 UAH CPE631 3 CPE 631 AM Basic Pipeline Scheduling Example Simple loop Assumptions for i 1 i 1000 i x i x i s Instruction producing result FP ALU op FP ALU op Load double Load double Integer op Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op Latency in clock cycles 3 2 1 0 0 R1 points to the last element in the array for simplicity we assume that x 0 is at the address 0 Loop L D F0 0 R1 F0 array el ADD D F4 F0 F2 add scalar in F2 S D 0 R1 F4 store result SUBI R1 R1 8 decrement pointer BNEZ R1 Loop branch 15 01 19 UAH CPE631 4 CPE 631 AM Executing FP Loop 1 Loop 2 3 4 5 6 7 8 9 10 Instruction producing result FP ALU op FP ALU op Load double Load double Integer op 15 01 19 LD Stall ADDD Stall Stall SD SUBI Stall BNEZ Stall F0 0 R1 F4 F0 F2 10 clocks per iteration 5 stalls Rewrite code to minimize stalls 0 R1 F4 R1 R1 8 R1 Loop Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op Latency in clock cycles 3 2 1 0 0 UAH CPE631 5 CPE Revised FP loop to 631 AM 1 Loop LD F0 0 R1 2 SUBI R1 R1 8 3 ADDD F4 F0 F2 4 Stall 5 BNEZ R1 Loop 6 SD 8 R1 F4 minimise stalls Swap BNEZ and SD by changing address of SD SUBI is moved up delayed branch altered and interch SUBI 6 clocks per iteration 1 stall but only 3 instructions do the actual work processing the array LD ADDD SD Unroll loop 4 times to improve potential for instr scheduling Instruction producing result FP ALU op FP ALU op Load double Load double Integer op 15 01 19 Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op UAH CPE631 Latency in clock cycles 3 2 1 0 0 6 CPE 631 AM LD Unrolled Loop 1 cycle stall F0 0 R1 2 cycles stall ADDD F4 F0 F2 SD 0 R1 F4 drop SUBI BNEZ LD F0 8 R1 ADDD F4 F0 F2 SD 8 R1 F4 drop SUBI BNEZ LD F0 16 R1 ADDD F4 F0 F2 SD 16 R1 F4 drop SUBI BNEZ LD F0 24 R1 ADDD F4 F0 F2 SD 24 R1 F4 SUBI R1 R1 32 BNEZ R1 Loop 15 01 19 UAH CPE631 This loop will run 28 cc 14 stalls per iteration each LD has one stall each ADDD 2 SUBI 1 BNEZ 1 plus 14 instruction issue cycles or 28 4 7 for each element of the array even slower than the scheduled version Rewrite loop to minimize stalls 7 CPE 631 AM Where are the name dependencies 1 Loop LD 2 ADDD 3 SD 4 LD 5 ADDD 6 SD 7 LD 8 ADDD 9 SD 10 LD 11 ADDD 12 SD 13 SUBUI 14 BNEZ 15 NOP F0 0 R1 F4 F0 F2 0 R1 F4 F0 8 R1 F4 F0 F2 8 R1 F4 F0 16 R1 F4 F0 F2 16 R1 F4 F0 24 R1 F4 F0 F2 24 R1 F4 R1 R1 32 R1 LOOP drop DSUBUI BNEZ drop DSUBUI BNEZ drop DSUBUI BNEZ alter to 4 8 How can remove them 15 01 19 UAH CPE631 8 CPE 631 AM Where are the name dependencies 1 Loop L D 2 ADD D 3 S D 4 L D 5 ADD D 6 S D 7 L D 8 ADD D 9 S D 10 L D 11 ADD D 12 S D 13 DSUBUI 14 BNEZ 15 NOP F0 0 R1 F4 F0 F2 0 R1 F4 F6 8 R1 F8 F6 F2 8 R1 F8 F10 16 R1 F12 F10 F2 16 R1 F12 F14 24 R1 F16 F14 F2 24 R1 F16 R1 R1 32 R1 LOOP drop DSUBUI BNEZ drop DSUBUI BNEZ drop DSUBUI BNEZ alter to 4 8 The Orginal register renaming 15 01 19 UAH CPE631 9 CPE Unrolled Loop that Minimise Stalls 631 AM Loop LD F0 0 R1 This loop will run 14 cycles LD F6 8 R1 no stalls per iteration or 14 4 3 5 for each element LD F10 16 R1 LD F14 24 R1 Assumptions that make this ADDD F4 F0 F2 possible ADDD F8 F6 F2 move LDs before SDs ADDD F12 F10 F2 move SD after SUBI and BNEZ ADDD F16 F14 F2 use different registers SD 0 R1 F4 SD 8 R1 F8 When is it safe for compiler SUBI R1 R1 32 to do such changes SD 16 R1 F12 BNEZ R1 Loop SD 8 R1 F4 15 01 19 UAH CPE631 10 CPE 631 AM Steps Compiler Performed to Unroll Determine that is OK to move the S D after SUBUI and BNEZ and find amount to adjust SD offset Determine that unrolling the loop would be useful by finding that the loop iterations were independent Rename registers to avoid name dependencies Eliminate extra test and branch instructions and adjust the loop termination and iteration code Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent requires analyzing memory addresses and finding that they do not refer to the same address Schedule the code preserving any dependences needed to yield same result as the original code 15 01 19 UAH CPE631 11 CPE 631 AM Multiple Issue Pipeline CPI Ideal pipeline CPI Structural stalls RAW stalls WAR stalls WAW stalls Control stalls Decrease Ideal pipeline CPI Multiple issue Superscalar Statically scheduled compiler techniques Dynamically scheduled Tomasulo s alg VLIW Very Long Instruction Word parallelism is explicitly indicated by instruction EPIC explicitly parallel instruction computers 15 01 19 UAH CPE631 12 CPE 631 AM Superscalar MIPS Superscalar MIPS 2 instructions 1 FP 1 anything else Fetch 64 bits clock cycle Int on left FP on right Can only issue 2nd instruction if 1st instruction issues More ports for FP registers to do FP load FP op in a pair 10 5 Time clocks I IF ID Ex Mem WB FP IF ID Ex Mem WB I IF ID Ex Mem WB FP IF ID Ex Mem WB IF ID Ex Mem WB IF …


View Full Document

UAH CPE 631 - Basic Pipeline Scheduling and Loop Unrolling

Loading Unlocking...
Login

Join to view Basic Pipeline Scheduling and Loop Unrolling and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Basic Pipeline Scheduling and Loop Unrolling and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?