Slide 1ReviewProblems for Pipelining CPUsStructural Hazard #1: Single Memory (1/2)Structural Hazard #1: Single Memory (2/2)Structural Hazard #2: Registers (1/2)Structural Hazard #2: Registers (2/2)Control Hazard: Branching (1/9)Control Hazard: Branching (2/9)Control Hazard: Branching (3/9)Control Hazard: Branching (4/9)Control Hazard: Branching (5/9)Control Hazard: Branching (6/9)Control Hazard: Branching (7/9)Control Hazard: Branching (8/9)Control Hazard: Branching (9/9)Example: Nondelayed vs. Delayed BranchData Hazards (1/2)Data Hazards (2/2)Data Hazard Solution: ForwardingData Hazard: Loads (1/4)Data Hazard: Loads (2/4)Data Hazard: Loads (3/4)Data Hazard: Loads (4/4)Peer Instruction“And in Conclusion..”Bonus slidesHistorical TriviaPipeline Hazard: Matching socks in later loadOut-of-Order Laundry: Don’t WaitSuperscalar Laundry: Parallel per stageSuperscalar Laundry: Mismatch MixPeer Instruction (1/2)Peer Instruction (2/2)inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 29 – CPU Design : Pipelining to Improve Performance II 2010-04-07IS 3D BAD FOR YOU? MANY HAVE EYESTRAIN!Cal researcher Marty Banks has put together a system to help with the eyestrain many viewers experience with 3D content on a small screen – the vergence / accomodation conflict. Lecturer SOE Dan Garciawww.technologyreview.com/computing/24976CS61C L29 CPU Design : Pipelining to Improve Performance II (2)Garcia, Spring 2010 © UCBReviewPipelining is a BIG ideaOptimal PipelineEach stage is executing part of an instruction each clock cycle.One instruction finishes during each clock cycle.On average, execute far more quickly.What makes this work?Similarities between instructions allow us to use same stages for all instructions (generally).Each stage takes about the same amount of time as all others: little wasted time.CS61C L29 CPU Design : Pipelining to Improve Performance II (3)Garcia, Spring 2010 © UCBProblems for Pipelining CPUsLimits to pipelining: Hazards prevent next instruction from executing during its designated clock cycleStructural hazards: HW cannot support some combination of instructions (single person to fold and put clothes away)Control hazards: Pipelining of branches causes later instruction fetches to wait for the result of the branchData hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)These might result in pipeline stalls or “bubbles” in the pipeline.CS61C L29 CPU Design : Pipelining to Improve Performance II (4)Garcia, Spring 2010 © UCBRead same memory twice in same clock cycle I$LoadInstr 1Instr 2Instr 3Instr 4ALU I$Reg D$ RegALU I$Reg D$ RegALU I$Reg D$ RegALUReg D$ RegALU I$Reg D$ RegInstr.OrderTime (clock cycles)Structural Hazard #1: Single Memory (1/2)CS61C L29 CPU Design : Pipelining to Improve Performance II (5)Garcia, Spring 2010 © UCBStructural Hazard #1: Single Memory (2/2)Solution:infeasible and inefficient to create second memory(We’ll learn about this more friday/next week)…so simulate this by having two Level 1 Caches (a temporary smaller [of usually most recently used] copy of memory)have both an L1 Instruction Cache and an L1 Data Cacheneed more complex hardware to control when both caches missCS61C L29 CPU Design : Pipelining to Improve Performance II (6)Garcia, Spring 2010 © UCBStructural Hazard #2: Registers (1/2)Can we read and write to registers simultaneously? I$swInstr 1Instr 2Instr 3Instr 4ALU I$Reg D$ RegALU I$Reg D$ RegALU I$Reg D$ RegALUReg D$ RegALU I$Reg D$ RegInstr.OrderTime (clock cycles)CS61C L29 CPU Design : Pipelining to Improve Performance II (7)Garcia, Spring 2010 © UCBStructural Hazard #2: Registers (2/2)Two different solutions have been used:1) RegFile access is VERY fast: takes less than half the time of ALU stageWrite to Registers during first half of each clock cycleRead from Registers during second half of each clock cycle2) Build RegFile with independent read and write portsResult: can perform Read and Write during same clock cycleCS61C L29 CPU Design : Pipelining to Improve Performance II (8)Garcia, Spring 2010 © UCBControl Hazard: Branching (1/9)Where do we do the compare for the branch? I$beqInstr 1Instr 2Instr 3Instr 4ALU I$Reg D$ RegALU I$Reg D$ RegALU I$Reg D$ RegALUReg D$ RegALU I$Reg D$ RegInstr.OrderTime (clock cycles)CS61C L29 CPU Design : Pipelining to Improve Performance II (9)Garcia, Spring 2010 © UCBControl Hazard: Branching (2/9)We had put branch decision-making hardware in ALU stagetherefore two more instructions after the branch will always be fetched, whether or not the branch is takenDesired functionality of a branchif we do not take the branch, don’t waste any time and continue executing normallyif we take the branch, don’t execute any instructions after the branch, just go to the desired labelCS61C L29 CPU Design : Pipelining to Improve Performance II (10)Garcia, Spring 2010 © UCBControl Hazard: Branching (3/9)Initial Solution: Stall until decision is madeinsert “no-op” instructions (those that accomplish nothing, just take time) or hold up the fetch of the next instruction (for 2 cycles).Drawback: branches take 3 clock cycles each (assuming comparator is put in ALU stage)CS61C L29 CPU Design : Pipelining to Improve Performance II (11)Garcia, Spring 2010 © UCBControl Hazard: Branching (4/9)Optimization #1:insert special branch comparator in Stage 2as soon as instruction is decoded (Opcode identifies it as a branch), immediately make a decision and set the new value of the PCBenefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is neededSide Note: This means that branches are idle in Stages 3, 4 and 5.CS61C L29 CPU Design : Pipelining to Improve Performance II (12)Garcia, Spring 2010 © UCBControl Hazard: Branching (5/9)Branch comparator moved to Decode stage. I$beqInstr 1Instr 2Instr 3Instr 4ALU I$Reg D$ RegALU I$Reg D$ RegALU I$Reg D$ RegALUReg D$ RegALU I$Reg D$ RegInstr.OrderTime (clock cycles)CS61C L29 CPU Design : Pipelining to Improve Performance II (13)Garcia, Spring 2010 © UCBControl Hazard: Branching (6/9)User inserting no-op instructionaddbeqnopALU I$Reg D$ RegALU I$Reg D$ RegALUReg D$ Reg I$Instr.OrderTime (clock cycles)bubbleImpact: 2 clock cycles per branch instruction
View Full Document