UCB CS61C Machine Structures Lecturer SOE Dan Garcia Lecture 29 CPU Design Pipelining to Improve Performance II 2010 04 07 IS 3D BAD FOR YOU MANY HAVE EYESTRAIN Cal researcher Marty Banks has put together a system to help with the eyestrain many viewers experience with 3D content on a small screen the vergence www technologyreview com computing 24976 accomodation conflict Review Pipelining is a BIG idea Optimal Pipeline Each stage is executing part of an instruction each clock cycle One instruction finishes during each clock cycle On average execute far more quickly What makes this work Similarities between instructions allow us to use same stages for all instructions generally Each stage takes about the same CS61C L29 CPU Design Pipelining to Improve Performance II 2 Garcia Spring 2010 UC Problems for Pipelining CPUs Limits to pipelining Hazards prevent next instruction from executing during its designated clock cycle Structural hazards HW cannot support some combination of instructions single person to fold and put clothes away Control hazards Pipelining of branches causes later instruction fetches to wait for the result of the branch Data hazards Instruction depends on result of prior instruction still in the pipeline missing sock These might result in pipeline stalls CS61C L29 CPU Design Pipelining to Improve Performance II 3 Garcia Spring 2010 UC Structural Hazard 1 Single Memory 1 2 Time clock cycles ALU I n I D Reg Reg s Load I D Reg Reg t Instr 1 r I D Reg Reg Instr 2 O I D Reg Reg Instr 3 r I D Reg Reg d Instr 4 e r Read same memory twice in same clock cycle ALU ALU ALU ALU CS61C L29 CPU Design Pipelining to Improve Performance II 4 Garcia Spring 2010 UC Structural Hazard 1 Single Memory 2 2 Solution infeasible and inefficient to create second memory We ll learn about this more friday next week so simulate this by having two Level 1 Caches a temporary smaller of usually most recently used copy of memory have both an L1 Instruction Cache and an L1 Data Cache need more complex hardware to CS61C L29 CPU Design Pipelining to Improve Performance II 5 Garcia Spring 2010 UC Structural Hazard 2 1 2 clock cycles I Registers Time Reg Reg D Reg I Reg D Reg I Reg ALU D Reg I Reg ALU I D ALU Reg ALU O Instr 2 r Instr 3 d e Instr 4 r I ALU n s t sw r Instr 1 D Reg Can we read and write to registers simultaneously CS61C L29 CPU Design Pipelining to Improve Performance II 6 Garcia Spring 2010 UC Structural Hazard 2 2 2 Registers Two different solutions have been used 1 RegFile access is VERY fast takes less than half the time of ALU stage Write to Registers during first half of each clock cycle Read from Registers during second half of each clock cycle 2 Build RegFile with independent read and write ports Result can perform Read and Write during same clock cycle CS61C L29 CPU Design Pipelining to Improve Performance II 7 Garcia Spring 2010 UC Control Hazard Branching 1 9 Time clock cycles ALU I n I D Reg Reg beq s I D Reg Reg t Instr 1 r I D Reg Reg Instr 2 O I D Reg Reg Instr 3 r I D Reg Reg d Instr 4 e r Where do we do the compare for the branch ALU ALU ALU ALU CS61C L29 CPU Design Pipelining to Improve Performance II 8 Garcia Spring 2010 UC Control Hazard Branching 2 9 We had put branch decision making hardware in ALU stage therefore two more instructions after the branch will always be fetched whether or not the branch is taken Desired functionality of a branch if we do not take the branch don t waste any time and continue executing normally if we take the branch don t execute any instructions after the branch just go to the desired label CS61C L29 CPU Design Pipelining to Improve Performance II 9 Garcia Spring 2010 UC Control Hazard Branching 3 9 Initial Solution Stall until decision is made insert no op instructions those that accomplish nothing just take time or hold up the fetch of the next instruction for 2 cycles Drawback branches take 3 clock cycles each assuming comparator is put in ALU stage CS61C L29 CPU Design Pipelining to Improve Performance II 10 Garcia Spring 2010 UC Control Hazard Branching 4 9 Optimization 1 insert special branch comparator in Stage 2 as soon as instruction is decoded Opcode identifies it as a branch immediately make a decision and set the new value of the PC Benefit since branch is complete in Stage 2 only one unnecessary instruction is fetched so only one noop is needed Side Note This means that branches are idle in Stages 3 4 and 5 CS61C L29 CPU Design Pipelining to Improve Performance II 11 Garcia Spring 2010 UC Control Hazard Branching 5 9 Time clock cycles ALU I n I D Reg Reg s beq I D Reg Reg t Instr 1 r I D Reg Reg Instr 2 O I D Reg Reg Instr 3 r I D Reg Reg d Instr 4 e r Branch comparator moved to Decode stage ALU ALU ALU ALU CS61C L29 CPU Design Pipelining to Improve Performance II 12 Garcia Spring 2010 UC Control Hazard Branching 6 9 ALU I User inserting no op instruction n Time clock cycles s I D Reg Reg t add r I D Reg Reg beq ALU ALU bub bub bub bub bub O nop ble ble ble ble ble r D Reg Reg I d lw e r Impact 2 clock cycles per branch instruction slow CS61C L29 CPU Design Pipelining to Improve Performance II 13 Garcia Spring 2010 UC Control Hazard Branching 7 9 ALU I Controller inserting a single n bubble Time clock cycles s I D Reg Reg t add r I D Reg Reg beq ALU ALU bub O lw D Reg Reg I ble r d e r Impact 2 clock cycles per branch instruction slow story about engineer physicist mathematician asked to build a fenceGarcia around Spring 2010 a UC CS61C L29 CPU Design Pipelining to Improve Performance II 14 Control Hazard Branching 8 9 Optimization 2 Redefine branches Old definition if we take the branch none of the instructions after the branch get executed by accident New definition whether or not we take the branch the single instruction immediately following the branch gets executed called the branch delay slot The term Delayed Branch means we always execute inst after branch CS61C L29 CPU Design Pipelining to Improve Performance II 15 Garcia Spring 2010 UC Control Hazard Branching 9 9 Notes on Branch Delay Slot Worst Case Scenario can always put a no op in the branch delay slot Better Case can find an instruction preceding the branch which can be placed in the branch delay slot without affecting flow of the program re ordering instructions is a common method of speeding up programs compiler must be very smart in order to find instructions to …
View Full Document
Unlocking...