DOC PREVIEW
UMD CMSC 411 - Basic Pipelining

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC 411Computer Systems ArchitectureyLecture 5Basic Pipelining (cont )Basic Pipelining (cont.)Alan [email protected]@csu deduAdministrivia• Homework problems for Unit 1 due today• Homework problems for Unit 3 posted soonCMSC 411 - 5 (from Patterson)2Forwarding to Avoid LW-SW Data HazardFigure A.8, Page A-29Time (clock cycles)IInstadd r1,r2,r3RegALUDMemIfetchRegr.Orlw r4, 0(r1)121RegLUDMemIfetchRegRegALUDMemIfetchRegrdersw r4,12(r1)or r8 r6 r9RegALDMemIfetchRegRegALUDMemIfetchRegor r8,r6,r9xor r10,r9,r11RegALUDMemIfetchRegCMSC 411 - 5 (from Patterson)3Data Hazard Even with ForwardingFigure A.9, Page A-20Time (clock cycles)Ilr1 0(r2)RLUDMIf t hRInstlwr1, 0(r2)sub r4r1r6RegALDMemIfetchRegRegLUDMemIfetchRegtr.Osub r4,r1,r6and r6r1r7RegALDMemIfetchRegRegALUDMemIfetchRegOrdeand r6,r1,r7or r8r1r9gARegALUDMemIfetchRegCMSC 411 - 5 (from Patterson)4ror r8,r1,r9AData Hazard Even with Forwarding(Similar to Figure A.10, Page A-21)Time (clock cycles)Time (clock cycles)Inslw r1, 0(r2)RegALUDMemIfetchRegstr.,()sub r4,r1,r6RegIfetchALUDMemRegBubbleOrde,,and r6,r1,r7IfetchALUDMemRegBubbleRegor r8,r1,r9erIfetchALUDMemBubbleRegCMSC 411 - 5 (from Patterson)5Software Scheduling InsteadTry producing fast code fora=b+c;a b c;d = e – f;assuming a, b, c, d ,e, and f in memory. Slow code:LW Rb,bLW Rc,cFast code:LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Re,e ADD Ra,Rb,RcLW Rf,fLW Rf,fSUB Rd,Re,RfSW d,RdSW a,Ra SUB Rd,Re,RfSW d,RdCMSC 411 - 5 (from Patterson)6Compiler optimizes for performance. Hardware checks for safety.Control hazards• Question: When do we find out that the PC needs to be modified?–Answer: In pipeline stage ID of a branch instruction– So, if a branch is not-taken (i.e., if the PC is not modified), need a one-cycle delay• Question: When is a taken branch’s address known?– ALU used to compute, so EX stage– Need two (or three) cycle delayCMSC 411 - 5 (from Patterson)7Example• If branch in 30% of instructions, then instead of executing 1 instruction per cycle, have 70% of i t ti ti i 1 l d 30% finstructions executing in 1 cycle and 30% of instructions executing in 2 cycles• An average of .7 + .6 = 1.3 cycles per instructiongyp– Worse by 30%CMSC 411 - 5 (from Patterson)8Control Hazard on Branches Three Stage Stall10:beq r1,r3,34RegALUDMemIfetchReg14:and r2,r3,r5 RegALUDMemIfetchReg18:or r6,r1,r7RegALUDMemIfetchReg22:add r8,r1,r9RegALUDMemIfetchRegU34:xor r10,r1,r11RegALUDMemIfetchRegWhat do you do with the 3 instructions in between?CMSC 411 - 5 (from Patterson)9How do you do it?Where is the “commit”?Branch Stall Impact• If CPI = 1, 30% branch, St ll 3 l CPI 1 9!Stall 3 cycles => new CPI = 1.9!• Two part solution:–Determine branch taken or nottaken sooner, ANDDetermine branch taken or nottaken sooner, AND– Compute taken branch address earlier• MIPS branch tests if register = 0 or ≠ 0MIPS S l i•MIPS Solution:– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3CMSC 411 - 5 (from Patterson)10Pipelined MIPS DatapathFigure A.24, page A-38g,pgMemoryAccessWriteBackInstructionFetchInstr. DecodeReg. FetchExecuteAddr. CalcNext Next PCAdderZero?4AdderNext SEQ PCNext PCMUXIF/ALUMemorReg FMDMeMEM/WEX/ME4AddresRS1RS2ID/EIDUryileMUXDataemoryMUXSignExtendWBEMatassXExtendRD RD RDWB DImmCMSC 411 - 5 (from Patterson)11• Interplay of instruction set design and cycle time.Four Branch Hazard Alternatives#1: Stall until branch direction is clear#2: Predict Branch Not Taken– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction#3: Predict Branch Taken– 53% MIPS branches taken on average–But haven’t calculated branch target address in MIPS–But haven t calculated branch target address in MIPS» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcomeCMSC 411 - 5 (from Patterson)12Four Branch Hazard Alternatives#4: Delayed Branch– Define branch to take place AFTER a following instructionbranch instructionsequential successor1sequential successor2........Branch delay of length n........sequential successornbranch target if takenyg–1 slot delay allows proper decision and branch target address in 5 stage pipeline– MIPS uses thisCMSC 411 - 5 (from Patterson)13Scheduling Branch Delay Slots (Fig A.14)add R1,R2,R3if R2=0 thenA. From before branch B. From branch target C. From fall throughadd R1,R2,R3if R1=0 thensub R4,R5,R6delay slotadd R1,R2,R3if R1=0 thendl ltdelay slotsubR4 R5 R6delay slotsubR4,R5,R6becomes becomes becomesadd R1,R2,R3if R2=0 thenadd R1,R2,R3add R1,R2,R3ifR1=0 thenif R1=0 thensub R4,R5,R6• A is the best choice, fills delay slot & reduces instruction count (IC)ifR10 thensub R4,R5,R6CMSC 411 - 5 (from Patterson)14,y ()• In B, the sub instruction may need to be copied, increasing IC• In B and C, must be okay to execute sub when branch failsScheduling branch delay slot• If taken from before branch– branch must not depend on rescheduled instruction– always improves performance•If taken from branch target•If taken from branch target– must be OK to execute rescheduled instructions if branch not taken, and may need to duplicate insts.fidhbhtk–performance improved when branch taken• If taken from fall through–must be OK to executeinstsif branch taken–must be OK to execute insts. if branch taken– improves performance when branch not takenCMSC 411 - 5 (from Patterson)15Delayed Branchy•Compiler effectiveness for single branch delay slot:Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful in computation– About 50% (60% x 80%) of slots usefully filled• Delayed Branch downside: As processor go to deeper pipelines and multiple issue the branchdeeper pipelines and multiple issue, the branch delay grows and need more than one delay slot– Delayed branching has lost popularity compared to more expensive but more flexible dynamic approachespypp– Growth in available transistors has made dynamic approaches relatively cheaperCMSC 411 - 5 (from Patterson)16Evaluating Branch AlternativesPi li dPipeline depthAssume: 4% unconditional branch, Pipeline speedup = Pipeline depth1 +Branch frequency × Branch


View Full Document

UMD CMSC 411 - Basic Pipelining

Documents in this Course
Load more
Download Basic Pipelining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Basic Pipelining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Basic Pipelining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?