Administrivia Homework problems for Unit 1 due today Homework problems for Unit 3 posted soon CMSC 411 Computer Systems y Architecture Lecture 5 Basic Pipelining cont cont Alan Sussman a s cs u d edu als cs umd edu 2 CMSC 411 5 from Patterson Data Hazard Even with Forwarding Forwarding to Avoid LW SW Data Hazard Figure A 9 Page A 20 Figure A 8 Page A 29 Time clock cycles Ifetch Reg DMem Ifetch Reg xor r10 r9 r11 CMSC 411 5 from Patterson Reg Reg Reg DMem 3 O r d e r sub r4 r4 r1 r6 r1 r6 and r6 r6 r1 r7 r1 r7 or R Reg DM DMem Ifetch Reg DMem Reg g Ifetch Ifetch r8 r8 r1 r9 r1 r9 CMSC 411 5 from Patterson R Reg Reg Reg DMem A ALU DMem If t h l r1 lw r1 0 r2 Ifetch A ALU Reg I n s t r Reg AL LU Ifetch Reg AL LU r8 r8 r6 r9 r6 r9 DMem ALU or Reg ALU sw r4 12 r1 12 1 Ifetch Time clock cycles DMem AL LU lw r4 0 r1 Reg ALU O r d e r add r1 r2 r3 Ifetch ALU I n s t r Reg Reg DMem 4 Data Hazard Even with Forwarding Software Scheduling Instead Similar to Figure A 10 Page A 21 Try producing fast code for a b c d e f assuming a b c d e and f in memory Reg Ifetch and r6 r1 r7 or r8 r1 r9 DMem Reg Reg Bubble Ifetch Bubble Reg Bubble Ifetch DMem Reg Reg DMem Reg ALU sub r4 r1 r6 Ifetch ALU O r d e r lw r1 0 r2 ALU I n s t r ALU Time clock cycles DMem Slow code LW LW ADD SW LW LW SUB SW Rb b Rc c Ra Rb Rc a Ra Re e Rf f Rd Re Rf d Rd Fast code LW LW LW ADD LW SW SUB SW Rb b Rc c Re e Ra Rb Rc Rf f a Ra Rd Re Rf d Rd Compiler optimizes for performance Hardware checks for safety CMSC 411 5 from Patterson 5 CMSC 411 5 from Patterson Control hazards Example Question When do we find out that the PC needs to be modified If branch in 30 of instructions then instead of executing 1 instruction per cycle have 70 of i t instructions ti executing ti iin 1 cycle l and d 30 off instructions executing in 2 cycles An average g of 7 6 1 3 cycles y per p instruction Answer In pipeline stage ID of a branch instruction So if a branch is not taken i e if the PC is not modified need a one cycle delay Question When is a taken branch s address known 6 Worse by 30 ALU used to compute so EX stage Need two or three cycle delay CMSC 411 5 from Patterson 7 CMSC 411 5 from Patterson 8 Branch Stall Impact Reg Reg DMem Ifetch Reg Ifetch 34 xor r10 r1 r11 Determine branch taken or not taken sooner AND Compute taken branch address earlier Reg Reg MIPS branch tests if register 0 or 0 MIPS Solution S l i Reg DMem ALU U 22 add r8 r1 r9 DMem If CPI 1 30 branch St ll 3 cycles Stall l new CPI 1 1 9 9 Two part solution Reg Ifetch r6 r1 r7 Reg ALU DMem ALU Ifetch 14 and r2 r3 r5 18 or Reg Ifetch ALU 10 beq r1 r3 34 ALU Control Hazard on Branches Three Stage Stall DMem Reg Move Zero test to ID RF stage Adder to calculate new PC in ID RF stage 1 clock cycle penalty for branch versus 3 What do you do with the 3 instructions in between How do you do it Where is the commit 9 CMSC 411 5 from Patterson CMSC 411 5 from Patterson Pipelined MIPS Datapath Four Branch Hazard Alternatives Figure g A 24 page p g A 38 Instruction Fetch Memory Access Write Back 1 Stall until branch direction is clear 2 Predict Branch Not Taken Adder Adder MUX Next SEQ PC Next PC RS1 Sign Extend RD RD RD Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47 MIPS branches not taken on average PC 4 already calculated so use it to get next instruction 3 Predict Branch Taken MUX MEM W WB D Data Me emory EX ME EM ALU U M MUX ID EX Imm Reg File RS2 IF ID Memorry Addresss Zero WB Data 4 Execute Addr Calc Instr Decode Reg Fetch 10 53 MIPS branches taken on average But haven haven tt calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Other machines branch target known before outcome Interplay of instruction set design and cycle time CMSC 411 5 from Patterson 11 CMSC 411 5 from Patterson 12 Scheduling Branch Delay Slots Fig A 14 Four Branch Hazard Alternatives A From before branch add R1 R2 R3 if R2 0 then 4 Delayed Branch delay slot Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 sequential successorn branch target if taken Branch delay y of length g n add R1 R2 R3 add R1 R2 R3 if R1 0 then d l slot delay l t becomes add R1 R2 R3 if R1 R1 0 0 then sub R4 R5 R6 C From fall through add R1 R2 R3 if R1 0 then delay slot sub R4 R5 R6 R4 R5 R6 becomes add R1 R2 R3 if R1 0 then sub R4 R5 R6 A is the best choice fills delay y slot reduces instruction count IC In B the sub instruction may need to be copied increasing IC In B and C must be okay to execute sub when branch fails 13 Scheduling branch delay slot CMSC 411 5 from Patterson 14 Delayed y Branch If taken from before branch Compiler effectiveness for single branch delay slot branch must not depend on rescheduled instruction always improves performance Fills about 60 of branch delay slots About 80 of instructions executed in branch delay slots useful in computation About 50 60 x 80 of slots usefully filled If taken from branch target must be OK to execute rescheduled instructions if branch not taken and may need to duplicate insts performance f improved i d when h branch b h taken t k Delayed Branch downside As processor go to deeper pipelines and multiple issue issue the branch delay grows and need more than one delay slot If taken from fall through Delayed branching has lost popularity compared to more expensive p but more flexible dynamic y approaches pp Growth in available transistors has made dynamic approaches relatively cheaper must be OK to execute insts insts if branch taken improves performance when branch not taken CMSC 411 5 from Patterson sub R4 R5 R6 if R2 0 then 1 slot delay allows proper decision and …
View Full Document