Unformatted text preview:

Loop Example Cycle 7Superscalar ProcessorsReview: Unrolled Loop that Minimizes Stalls for ScalarLoop Unrolling in SuperscalarDynamic Branch PredictionOne-bit predictorSlide 7Analyze BHTN-bit saturating counterCan we do better?Correlating BranchesSlide 12Accuracy of Different Schemes (Figure 4.21, p. 272)Re-evaluating CorrelationNeed Address at Same Time as PredictionDynamic Branch Prediction SummaryPredicated instructionsPredicated InstructionsHW support for More ILPHW support for SpeculationFour Steps of Speculative Tomasulo AlgorithmRenaming RegistersDynamic Scheduling in PowerPC 604 and Pentium ProSlide 24Dynamic Scheduling in Pentium ProGetting CPI < 1: Issuing Multiple Instructions/CycleSlide 27Multiple Issue ChallengesLoop Unrolling in VLIWTrace SchedulingAdvantages of HW (Tomasulo) vs. SW (VLIW) SpeculationSuperscalar v. VLIWIntel/HP “Explicitly Parallel Instruction Computer (EPIC)”Dynamic Scheduling in SuperscalarSlide 35Performance of Dynamic SSSoftware PipeliningSoftware Pipelining ExampleLimits to Multi-Issue MachinesSlide 40Limits to ILPSlide 42Upper Limit to ILP: Ideal Machine (Figure 4.38, page 319)More Realistic HW: Branch Impact Figure 4.40, Page 323Selective History PredictorMore Realistic HW: Register Impact Figure 4.44, Page 328More Realistic HW: Alias Impact Figure 4.46, Page 330Realistic HW for ‘9X: Window Impact (Figure 4.48, Page 332)Braniac vs. Speed Demon(1993)3 1996 Era MachinesSPECint95base Performance (July 1996)SPECfp95base Performance (July 1996)3 1997 Era MachinesSPECint95base Performance (Oct. 1997)SPECfp95base Performance (Oct. 1997)SummaryFTC.W99 1Loop Example Cycle 7Instruction status ExecutionWriteInstructionj k iteration IssuecompleteResult Busy AddressLD F0 0 R1 1 1 Load1 Yes 80MULTDF4 F0 F2 1 2 Load2 Yes 72SD F4 0 R1 1 3 Load3 No QiLD F0 0 R1 2 6 Store1 Yes 80 Mult1MULTDF4 F0 F2 2 7 Store2 NoSD F4 0 R1 2 Store3 NoReservation Stations S1 S2RS for jRS for kTimeNameBusyOp Vj Vk Qj Qk Code:0 Add1 No LD F0 0 R10 Add2 NoMULTDF4 F0 F20 Add3 No SD F4 0 R10 Mult1 Yes MULTD R(F2) Load1 SUBI R1 R1 #80 Mult2 Yes MULTD R(F2) Load2 BNEZ R1 LoopRegister result statusClockR1F0 F2 F4 F6 F8F10F12...F307 72 Qi Load2 Mult2• Note: MULT2 has no registers names in RSFTC.W99 2Superscalar Processors•What support do we need to fetch & issue multiple instructions/cycle? (that we have already seen)–1.–2.FTC.W99 3Review: Unrolled Loop that Minimizes Stalls for Scalar1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SD -16(R1),F1212 SUBI R1,R1,#3213 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -2414 clock cycles, or 3.5 per iterationLD to ADDD: 1 CycleADDD to SD: 2 CyclesFTC.W99 4Loop Unrolling in SuperscalarInteger instruction FP instruction Clock cycleLoop: LD F0,0(R1) 1LD F6,-8(R1) 2LD F10,-16(R1) ADDD F4,F0,F2 3LD F14,-24(R1) ADDD F8,F6,F2 4LD F18,-32(R1) ADDD F12,F10,F2 5SD 0(R1),F4 ADDD F16,F14,F2 6SD -8(R1),F8 ADDD F20,F18,F2 7SD -16(R1),F12 8SD -24(R1),F16 9SUBI R1,R1,#40 10BNEZ R1,LOOP 11SD -32(R1),F20 12•Unrolled 5 times to avoid delays (+1 due to SS)•12 clocks, or 2.4 clocks per iteration (1.5X)FTC.W99 5Dynamic Branch Prediction•With out-of-order or superscalars, branches affect performance more. Why?•How do we judge different prediction schemes?–Space–Performance - f(accuracy, cost of misprediction)FTC.W99 6One-bit predictor•Keep 1-bit for each branch–tells whether it was taken last time•Store lower bits of address in table–no address check - may be incorrect branch!•If it stores lower 8 bits, what is the table size?•What is the accuracy of a nine-iteration loop?•What is the penalty in DLX?FTC.W99 7Dynamic Branch Prediction•Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)•Red: stop, not taken•Green: go, takenTTTTNTNTNTNTPredict TakenPredict Not Taken Predict TakenPredict Not TakenFTC.W99 8Analyze BHT•How does it do with the 9-iteration loop?•Mispredict because either:–Wrong guess for that branch–Got branch history of wrong branch when index the table•4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%•4096 about as good as infinite table(in Alpha 211164)FTC.W99 9N-bit saturating counter•N bit counter can take from 0 to (2^n)-1•if count >= 2^(n-1), branch predict taken•if taken, increment; if untaken, decrement•How does 2-bit counter differ from 2-bit BHT?FTC.W99 10Can we do better?•If (d==0)•{ …….; d = 1;……}•if (d==1)•{ blah, blah, blah}FTC.W99 11Correlating Branches•Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch•Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table•In general, (m,n) predictor means record last m branches to select between 2m history tables each with n-bit counters–What, then was the 1-bit predictor? 2-bit predictor?FTC.W99 12Correlating Branches(2,2) predictor–Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction –If b3-not taken, b4-taken, this shows entry for b5.–Where is entry for b5 if b3-taken, b4-not taken?–What’s relationship between this and 2-bit?Branch address2-bits per branch predictorsPredictionPrediction2-bit global branch history00 01 10 11FTC.W99 13Frequency of Mispredictions0%2%4%6%8%10%12%14%16%18%nasa7matrix300tomcatvdoducdspicefppppgccespressoeqntottli0%1%5%6%6%11%4%6%5%1%4,096 entries: 2-bits per entryUnlimited entries: 2-bits/entry1,024 entries (2,2)Accuracy of Different Schemes(Figure 4.21, p. 272)4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT0%18%Frequency of MispredictionsFTC.W99 14Re-evaluating Correlation•Several of the SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches:program branch % static# = 90%compress 14%236 13eqntott 25%494 5gcc 15%9531 2020mpeg 10%5598 532real gcc 13%17361 3214•Real programs + OS more like gcc•Small benefits beyond benchmarks for correlation? problems with branch aliases?•What does this say about benchmarks?FTC.W99 15Need Address at Same Time as Prediction•Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address


View Full Document

UCD ECS 201A - Loop Example Cycle 7

Download Loop Example Cycle 7
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Loop Example Cycle 7 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Loop Example Cycle 7 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?