CS152 Computer Architecture and Engineering Lecture 15 Static Pipeline Scheduling Compiler Optimizations March 31 2003 John Kubiatowicz http cs berkeley edu kubitron lecture slides http www inst eecs berkeley edu cs152 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Recall Achieving Precise Exceptions Time Bad Inst Inst TLB fault Overflow IFetch Dcd Program Flow Data TLB Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem WB Exec Mem IFetch Dcd IFetch Dcd WB Use pipeline to sort this out Pass exception status along with instruction Keep track of PCs for every instruction in pipeline Don t act on exception until it reach WB stage Handle interrupts through faulting noop in IF stage When instruction reaches end of MEM stage Save PC EPC Interrupt vector addr PC Turn all instructions in earlier stages into noops 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Recall Freeze above Bubble Below IAU npc I mem Regs op rw rs rt freeze PC bubble B A im n op rw alu S n op rw Flush accomplished by setting invalid bit in pipeline D mem m n op rw Regs 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Examples of stalls bubbles Exceptions Flush everything above Prevent instructions following exception from commiting state Freeze fetch until exception resolved Stalls Introduce brief stalls into pipeline Decode stage recognizes that current instruction cannot proceed Freeze fetch stage Introduce bubble into EX stage instead of forwarding stalled inst Can stall until condition is resolved Examples mfhi mflo need to wait for multiply divide unit to finish Break instruction for Lab5 stall until release line received Load delay slot handled this way as well 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Recall Case Study MIPS R4000 200 MHz 8 Stage Pipeline IF first half of fetching of instruction PC selection happens here as well as initiation of instruction cache access IS second half of access to instruction cache RF instruction decode and register fetch hazard checking and also instruction cache hit detection EX execution which includes effective address calculation ALU operation and branch target computation and condition evaluation DF data fetch first half of access to data cache DS second half of access to data cache TC tag check determine whether the data cache access hit WB write back for loads and register register operations 8 Stages What is impact on Load delay Branch delay Why 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Recall Case Study MIPS R4000 IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF IF THREE Cycle Branch Latency conditions evaluated during EX phase IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF TWO Cycle Load Latency Delay slot plus two stalls Branch likely cancels delay slot if not taken 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Can we somehow make CPI closer to 1 Let s assume full pipelining If we have a 4 cycle instruction then we need 3 instructions between a producing instruction and its use multf F0 F2 F4 delay 1 delay 2 delay 3 addf F6 F10 F0 Earliest forwarding for 4 cycle instructions Earliest forwarding for 1 cycle instructions Fetch Decode Ex1 Ex2 Ex3 Ex4 WB addf delay3 delay2 delay1 multf 3 31 03 UCB Spring 2003 CS152 Kubiatowicz FP Loop Where are the Hazards Loop LD F0 0 R1 F0 vector element ADDD F4 F0 F2 add scalar from F2 SD store result 0 R1 F4 SUBI R1 R1 8 decrement pointer 8B DW BNEZ R1 Loop branch R1 zero NOP delayed branch slot Instruction Instruction Latency in producing result using result cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 clock Where are the stalls 3 31 03 UCB Spring 2003 CS152 Kubiatowicz FP Loop Showing Stalls 1 Loop LD F0 0 R1 2 stall 3 ADDD F4 F0 F2 4 stall 5 stall 6 SD 0 R1 F4 7 SUBI R1 R1 8 8 BNEZ R1 Loop 9 stall F0 vector element add scalar in F2 store result decrement pointer 8B DW branch R1 zero delayed branch slot Instruction Instruction Latency in producing result using result cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 clock 9 clocks Rewrite code to minimize stalls 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Revised FP Loop Minimizing Stalls 1 Loop LD F0 0 R1 2 stall 3 ADDD F4 F0 F2 4 SUBI R1 R1 8 5 BNEZ R1 Loop 6 SD 8 R1 F4 delayed branch altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 clock 6 clocks Unroll loop 4 times code to make faster 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Unroll Loop Four Times straightforward way 1 Loop LD 2 ADDD 3 SD 4 LD 5 ADDD 6 SD 7 LD 8 ADDD 9 SD 10 LD 11 ADDD 12 SD 13 SUBI 14 BNEZ 15 NOP F0 0 R1 F4 F0 F2 0 R1 F4 F6 8 R1 F8 F6 F2 8 R1 F8 F10 16 R1 F12 F10 F2 16 R1 F12 F14 24 R1 F16 F14 F2 24 R1 F16 R1 R1 32 R1 LOOP 1 cycle stall 2 cycles stall drop SUBI BNEZ Rewrite loop to minimize stalls drop SUBI BNEZ drop SUBI BNEZ alter to 4 8 15 4 x 1 2 27 clock cycles or 6 8 per iteration Assumes R1 is multiple of 4 CS152 Kubiatowicz 3 31 03 CPI 27 15 1 8 UCB Spring 2003 Unrolled Loop That Minimizes Stalls 1 Loop LD 2 LD 3 LD 4 LD 5 ADDD 6 ADDD 7 ADDD 8 ADDD 9 SD 10 SD 11 SD 12 SUBI 13 BNEZ 14 SD F0 0 R1 F6 8 R1 F10 16 R1 F14 24 R1 F4 F0 F2 F8 F6 F2 F12 F10 F2 F16 F14 F2 0 R1 F4 8 R1 F8 16 R1 F12 R1 R1 32 R1 LOOP 8 R1 F16 What assumptions made when moved code OK to move store past SUBI even though changes register OK to move loads before stores get right data When is it safe for compiler to do such changes 8 32 24 14 clock cycles or 3 5 per iteration CPI 14 14 1 When safe to move instructions 3 31 03 UCB Spring 2003 CS152 Kubiatowicz Getting CPI 1 Issuing Multiple Instructions Cycle Two main variations Superscalar and VLIW Superscalar varying no instructions cycle 1 to 6 Parallelism and dependencies determined resolved by HW IBM PowerPC 604 Sun UltraSparc DEC Alpha 21164 HP …
View Full Document
Unlocking...