CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations Cont Dynamic Scheduling with Scoreboards October 26 2001 John Kubiatowicz http cs berkeley edu kubitron lecture slides http www inst eecs berkeley edu cs152 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Recall Achieving Precise Exceptions Time Bad Inst Inst TLB fault Overflow IFetch Dcd Program Flow Data TLB Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem WB Exec Mem IFetch Dcd IFetch Dcd WB Use pipeline to sort this out Pass exception status along with instruction Keep track of PCs for every instruction in pipeline Don t act on exception until it reach WB stage Handle interrupts through faulting noop in IF stage When instruction reaches end of MEM stage Save PC EPC Interrupt vector addr PC Turn all instructions in earlier stages into noops 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Recall Can we somehow make CPI closer to 1 Let s assume full pipelining If we have a 4 cycle instruction then we need 3 instructions between a producing instruction and its use multf F0 F2 F4 delay 1 delay 2 delay 3 addf F6 F10 F0 Earliest forwarding for 4 cycle instructions Earliest forwarding for 1 cycle instructions Fetch Decode Ex1 Ex2 Ex3 Ex4 WB addf delay3 delay2 delay1 multf 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Recall FP Loop Showing Stalls 1 Loop LD F0 0 R1 2 stall 3 ADDD F4 F0 F2 4 stall 5 stall 6 SD 0 R1 F4 7 SUBI R1 R1 8 8 BNEZ R1 Loop 9 stall F0 vector element add scalar in F2 store result decrement pointer 8B DW branch R1 zero delayed branch slot Instruction Instruction Latency in producing result using result cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clocks CPI 9 5 1 8 Rewrite code to minimize stalls 10 26 01 UCB Fall 2001 clock CS152 Kubiatowicz Recall Revised FP Loop Minimizing Stalls 1 Loop LD F0 0 R1 2 stall 3 ADDD F4 F0 F2 4 SUBI R1 R1 8 5 BNEZ R1 Loop 6 SD 8 R1 F4 delayed branch altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 clock 6 clocks CPI 6 5 1 2 Unroll loop 4 times code to make faster 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Unroll Loop Four Times straightforward way 1 Loop LD 2 ADDD 3 SD 4 LD 5 ADDD 6 SD 7 LD 8 ADDD 9 SD 10 LD 11 ADDD 12 SD 13 SUBI 14 BNEZ 15 NOP F0 0 R1 F4 F0 F2 0 R1 F4 F6 8 R1 F8 F6 F2 8 R1 F8 F10 16 R1 F12 F10 F2 16 R1 F12 F14 24 R1 F16 F14 F2 24 R1 F16 R1 R1 32 R1 LOOP 1 cycle stall 2 cycles stall drop SUBI BNEZ Rewrite loop to minimize stalls drop SUBI BNEZ drop SUBI BNEZ alter to 4 8 15 4 x 1 2 27 clock cycles or 6 8 per iteration Assumes R1 is multiple of 4 CS152 Kubiatowicz 10 26 01 CPI 27 15 1 8 UCB Fall 2001 Unrolled Loop That Minimizes Stalls 1 Loop LD 2 LD 3 LD 4 LD 5 ADDD 6 ADDD 7 ADDD 8 ADDD 9 SD 10 SD 11 SD 12 SUBI 13 BNEZ 14 SD F0 0 R1 F6 8 R1 F10 16 R1 F14 24 R1 F4 F0 F2 F8 F6 F2 F12 F10 F2 F16 F14 F2 0 R1 F4 8 R1 F8 16 R1 F12 R1 R1 32 R1 LOOP 8 R1 F16 What assumptions made when moved code OK to move store past SUBI even though changes register OK to move loads before stores get right data When is it safe for compiler to do such changes 8 32 24 14 clock cycles or 3 5 per iteration CPI 14 14 1 When safe to move instructions 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Getting CPI 1 Issuing Multiple Instructions Cycle Two main variations Superscalar and VLIW Superscalar varying no instructions cycle 1 to 6 Parallelism and dependencies determined resolved by HW IBM PowerPC 604 Sun UltraSparc DEC Alpha 21164 HP 7100 Very Long Instruction Words VLIW fixed number of instructions 16 parallelism determined by compiler Pipeline is exposed compiler must schedule delays to get right result Explicit Parallel Instruction Computer EPIC Intel 128 bit packets containing 3 instructions can execute sequentially Can link 128 bit packets together to allow more parallelism Compiler determines parallelism HW checks dependencies and fowards stalls 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Getting CPI 1 Issuing Multiple Instructions Cycle Superscalar DLX 2 instructions 1 FP 1 anything else Fetch 64 bits clock cycle Int on left FP on right Can only issue 2nd instruction if 1st instruction issues More ports for FP registers to do FP load FP op in a pair Type Int instruction FP instruction Int instruction FP instruction Int instruction FP instruction PipeStages IF ID IF ID IF IF EX MEM WB EX MEM WB ID EX MEM WB ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 1 cycle load delay expands to 3 instructions in SS instruction in right half can t use it nor instructions in next slot 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Loop Unrolling in Superscalar Integer instruction Loop LD F0 0 R1 LD F6 8 R1 LD F10 16 R1 LD F14 24 R1 LD F18 32 R1 SD 0 R1 F4 SD 8 R1 F8 SD 16 R1 F12 SD 24 R1 F16 SUBI R1 R1 40 BNEZ R1 LOOP SD 32 R1 F20 FP instruction 1 2 ADDD F4 F0 F2 ADDD F8 F6 F2 ADDD F12 F10 F2 ADDD F16 F14 F2 ADDD F20 F18 F2 8 9 10 11 12 Clock cycle 3 4 5 6 7 Unrolled 5 times to avoid delays 1 due to SS 12 clocks or 2 4 clocks per iteration 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Limits of Superscalar While Integer FP split is simple for the HW get CPI of 0 5 only for programs with Exactly 50 FP operations No hazards If more instructions issue at same time greater difficulty of decode and issue Even 2 scalar examine 2 opcodes 6 register specifiers decide if 1 or 2 instructions can issue VLIW tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition all the operations the compiler puts in the long instruction word can execute in parallel E g 2 integer operations 2 FP ops 2 Memory refs 1 branch 16 to 24 bits per field 7 16 or 112 bits to 7 24 or 168 bits wide Need compiling technique that schedules across several branches 10 26 01 UCB Fall 2001 CS152 Kubiatowicz Loop Unrolling in VLIW Memory reference 1 Memory reference 2 FP operation …
View Full Document
Unlocking...