Pipelining IVTopicsTopics Implementing pipeline control Pipelining and performance analysisSystems I2Implementing Pipeline Control Combinational logic generates pipeline control signals Action occurs at start of following cycleEMWFDCCCCrBsrcAsrcBicode valE valM dstE dstMBchicode valE valA dstE dstMicode ifun valC valA valB dstE dstM srcA srcBvalC valPicode ifun rApredPCd_srcBd_srcAe_BchD_icodeE_icodeM_icodeE_dstMPipecontrollogicD_bubbleD_stallE_bubbleF_stall3Initial Version of Pipeline Controlbool F_stall =# Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } ||# Stalling at fetch while ret passes through pipelineIRET in { D_icode, E_icode, M_icode };bool D_stall =# Conditions for a load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB };bool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode };bool E_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Load/use hazardE_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB};4Control Combinations Special cases that can arise on same clock cycleCombination ACombination A Not-taken branch ret instruction at branch targetCombination BCombination B Instruction that reads from memory to %esp Followed by ret instructionLoadEUseDMLoad/useJXXEDMMispredictJXXEDMMispredictEretDMret 1retEbubbleDMret 2bubbleEbubbleDretMret 3EretDMret 1EretDMret 1retEbubbleDMret 2retEbubbleDMret 2bubbleEbubbleDretMret 3bubbleEbubbleDretMret 3Combination BCombination A5Control Combination A Should handle as mispredicted branch Stalls F pipeline register But PC selection logic will be using M_valM anyhowJXXEDMMispredictJXXEDMMispredictEretDMret1EretDMret1EretDMret1Combination AnormalnormalnormalnormalbubblebubblebubblebubblestallstallCombinationCombinationnormalnormalstallstallFFbubblebubblebubblebubbleDDbubblebubblenormalnormalEEnormalnormalnormalnormalMMnormalnormalnormalnormalWWMispredicted Mispredicted BranchBranchProcessing retProcessing retConditionConditionEMWFDInstructionmemoryInstructionmemoryPCincrementPCincrementRegisterfileRegisterfileALUALUDatamemoryDatamemorySelectPCrBdstE dstMSelectAALUAALUBMem.controlAddrsrcA srcBreadwriteALUfun.FetchDecodeExecuteMemoryWrite backicodedata outdata inA BMEM_valAW_valMW_valEM_valAW_valMd_rvalAf_PCPredictPCvalE valM dstE dstMBchicode valE valA dstE dstMicode ifun valC valA valB dstE dstM srcA srcBvalC valPicode ifun rApredPCCCCCd_srcBd_srcAe_BchM_BchCCCCd_srcBd_srcAe_BchM_Bch6Control Combination B Would attempt to bubble and stall pipeline register D Signaled by processor as pipeline errorLoadEUseDMLoad/useEretDMret1EretDMret1EretDMret1Combination BstallstallstallstallstallstallFFbubble +bubble +stallstallstallstallbubblebubbleDDbubblebubblebubblebubblenormalnormalEEnormalnormalnormalnormalnormalnormalMMnormalnormalnormalnormalnormalnormalWWCombinationCombinationLoad/Use HazardLoad/Use HazardProcessing retProcessing retConditionCondition7Handling Control Combination B Load/use hazard should get priority ret instruction should be held in decode stage for additionalcycleLoadEUseDMLoad/useEretDMret1EretDMret1EretDMret1Combination BstallstallstallstallstallstallFFstallstallstallstallbubblebubbleDDbubblebubblebubblebubblenormalnormalEEnormalnormalnormalnormalnormalnormalMMnormalnormalnormalnormalnormalnormalWWCombinationCombinationLoad/Use HazardLoad/Use HazardProcessing retProcessing retConditionCondition8Corrected Pipeline Control Logic Load/use hazard should get priority ret instruction should be held in decode stage for additionalcyclestallstallstallstallstallstallFFstallstallstallstallbubblebubbleDDbubblebubblebubblebubblenormalnormalEEnormalnormalnormalnormalnormalnormalMMnormalnormalnormalnormalnormalnormalWWCombinationCombinationLoad/Use HazardLoad/Use HazardProcessing retProcessing retConditionConditionbool D_bubble =# Mispredicted branch(E_icode == IJXX && !e_Bch) ||# Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB });9Pipeline SummaryData HazardsData Hazards Most handled by forwarding No performance penalty Load/use hazard requires one cycle stallControl HazardsControl Hazards Cancel instructions when detect mispredicted branch Two clock cycles wasted Stall fetch stage while ret passes through pipeline Three clock cycles wastedControl CombinationsControl Combinations Must analyze carefully First version had subtle bug Only arises with unusual instruction combination10Performance Analysis with PipeliningIdeal pipelined machine: CPI = 1Ideal pipelined machine: CPI = 1 One instruction completed per cycle But much faster cycle time than unpipelined machineHowever - hazards are working against the idealHowever - hazards are working against the ideal Hazards resolved using forwarding are fine Stalling degrades performance and instruction comletionrate is interruptedCPI is measure of CPI is measure of ““architectural efficiencyarchitectural efficiency”” of design of designCycleSecondsnInstructioCyclesProgramnsInstructioProgramSeconds timeCPU !!==11Computing CPICPICPI Function of useful instruction and bubbles Cb/Ci represents the pipeline penalty due to stallsCan reformulate to account forCan reformulate to account for load penalties (lp) branch misprediction penalties (mp) return penalties (rp)! CPI =Ci+ CbCi= 1.0 +CbCi! CPI = 1.0 + lp + mp + rp12Computing CPI - IISo how do we determine the penalties?So how do we determine the penalties? Depends on how often each situation occurs on average How often does a load occur and how often does that loadcause a stall? How often does a branch occur and how often is itmispredicted How often does a return occur?We can measure theseWe can measure these simulator hardware performance countersWe can estimate throughWe can estimate through historical averageshistorical averages Then use to make early design tradeoffs for architecture13Computing CPI - IIICPICPI = 1 + 0.31 = 1.31 == 31% worse than ideal= 1 + 0.31 = 1.31 == 31% worse than idealThis gets worse when:This gets worse when: Account for non-ideal memory access latency Deeper pipelines (where stalls per hazard increase)0.310.31Total
View Full Document