Superscalar Processing CS 740 September 25-27, 2000Intel x86 ProcessorsOther ProcessorsArchitectural Performancex86 ISA Characteristicsi486 PipelinePipeline Stage DetailsStage Details (Cont.)Data HazardsControl HazardsControl Hazards (Cont.)Comparison with Our pAlpha PipelineComparison to 386Pentium Block DiagramPentium PipelineSuperscalar ExecutionBranch PredictionSuperscalar TerminologySuperscalar Execution ExampleAdding Advanced FeaturesPentium Pro (P6)PentiumPro Block DiagramPentiumPro OperationSlide 24Limitations of x86 Instruction SetPPC 604604 Block DiagramGeneral PrinciplesProcessing StagesFetching InstructionsSlide 31DispatchDispatching ActionsHazard Handling with RenamingRead-after-Write (RAW) DependencesWrite-after-Read (WAR) DependencesWrite-after-Write (WAW) DependencesMoving Instructions AroundExecution ResourcesRetiring Instructions604 ChipExecution ExampleExecution Example Cycle 1Execution Example Cycle 2Cycle 3Execution Example Cycle 4Execution Example Cycle 5Execution Example Cycle 6Execution Example Cycle 7Living with Expensive BranchesBranch Prediction ExampleSome Interesting PatternsLoop Performance (FP)Loop 1 SurprisesP6 Branch PredictionBranch Prediction ComparisonsEffect of Loop UnrollingMIPS R10000DEC Alpha 2126421264 Block Diagram21264 Pipeline21264 Branch Prediction LogicProcessor ComparisonsChallenges AheadNew Era for Performance OptimizationSuperscalar ProcessingCS 740September 25-27, 2000Intel Processors•486, Pentium, Pentium ProSuperscalar Processor Design•Use PowerPC 604 as case study•Speculative Execution, Register Renaming, Branch PredictionMore Superscalar Examples•MIPS R10000•DEC Alpha 21264CS 740 F’00– 2 –Intel x86 ProcessorsProcessor YearTransistorsMHzSpec92 (Int/FP)Spec95 (Int/FP)8086 ‘78 29K4Basis of IBM PC & PC-XTi286 ‘83 134K8Basis of IBM PC-ATi386 ‘86 275K16‘88 33 6 / 3i486 ‘89 1.2M2050 28 / 13Pentium ‘93 3.1M6678 / 64150 181 / 1254.3 / 3.0PentiumPro ‘955.5M 150245 / 2206.1 / 4.8200 320 / 2838.2 / 6.0Pentium II ‘97 7.5M300 11.6 / 6.8Merced ‘00? 14M? ??CS 740 F’00– 3 –Other ProcessorsProcessor YearTransistors MHz Spec92 Spec95MIPS R3000 ‘88 25 16.1 / 21.7(DecStation 5000/120)MIPS R5000 3.6M 180 4.1 / 4.4(Wean Hall SGIs)MIPS R10000‘95 5.9M 200 300 / 600 8.9 / 17.2(Most Advanced MIPS)Alpha 21164a ‘96 9.3M 417 500 / 750 11 / 17500 12.6 / 18.3(Fastest Available)Alpha 21264 ‘97 15M 500 30 / 60(Fastest Announced)CS 740 F’00– 4 –Architectural PerformanceMetric•SpecX92/Mhz: Normalizes with respect to clock speed•But … one measure of good arch. is how fast can run clockSamplingProcessor MHz SpecInt92 IntAP SpecFP92 FltAPi386/387 33 6 0.2 3 0.1i486DX 50 28 0.6 13 0.3Pentium 150 181 1.2 125 0.8PentiumPro 200 320 1.6 283 1.4MIPS R3000A 25 16.1 0.6 21.7 0.9MIPS R10000200 300 1.5 600 3.0Alpha 21164a417 500 1.2 750 1.8CS 740 F’00– 5 –x86 ISA CharacteristicsMultiple Data Sizes and Addressing Methods•Recent generations optimized for 32-bit modeLimited Number of Registers•Stack-oriented procedure call and FP instructions•Programs reference memory heavily (41%)Variable Length Instructions•First few bytes describe operation and operands•Remaining ones give immediate data & address displacements•Average is 2.5 bytesCS 740 F’00– 6 –i486 PipelineFetch•Load 16-bytes of instruction into prefetch bufferDecode1•Determine instruction length, instruction typeDecode2•Compute memory address•Generate immediate operandsExecute•Register Read•ALU operation•Memory read/writeWrite-Back•Update register fileCS 740 F’00– 7 –Pipeline Stage DetailsFetch•Moves 16 bytes of instruction stream into code queue•Not required every time–About 5 instructions fetched at once–Only useful if don’t branch•Avoids need for separate instruction cacheD1•Determine total instruction length–Signals code queue aligner where next instruction begins•May require two cycles–When multiple operands must be decoded–About 6% of “typical” DOS programCS 740 F’00– 8 –Stage Details (Cont.)D2•Extract memory displacements and immediate operands•Compute memory addresses–Add base register, and possibly scaled index register•May require two cycles–If index register involved, or both address & immediate operand–Approx. 5% of executed instructionsEX•Read register operands•Compute ALU function•Read or write memory (data cache)WB•Update register resultCS 740 F’00– 9 –Data HazardsData HazardsGenerated Used HandlingALU ALU EX–EX ForwardingLoad ALU EX–EX ForwardingALU Store EX–EX ForwardingALU Eff. Address (Stall) + EX–ID2 ForwardingCS 740 F’00– 10 –Control HazardsJump Instruction Processsing•Continue pipeline assuming branch not taken•Resolve branch condition in EX stage•Also speculatively fetch at target during EX stageID1 ID2 EXJump Instr.ID1 ID2Jump +1ID1Jump +2FetchTargetCS 740 F’00– 11 –Control Hazards (Cont.)Branch taken•Flush instructions in pipe•Begin ID1 at target.•Total of 3 cycles for instructionBranch Not Taken•Allow pipeline to continue.•Total of 1 cycle for instructionID1 ID2 EXJump Instr.ID1 ID2Jump +1ID1Jump +2FetchTargetEXID2(Flushed)Jump +3ID1ID1 ID2 EXJump Instr.ID1 ID2Jump +1ID1Jump +2FetchTarget(Flushed)ID1(Flushed)CS 740 F’00– 12 –Comparison with Our pAlpha PipelineTwo Decoding Stages•Harder to decode CISC instructions•Effective address calculation in D2Multicycle Decoding Stages•For more difficult decodings•Stalls incoming instructionsCombined Mem/EX Stage•Avoids load stall without load delay slot–But introduces stall for address computationCS 740 F’00– 13 –Comparison to 386Cycles Per InstructionInstruction Type 386 Cycles 486 CyclesLoad 4 1Store 2 1ALU 2 1Jump taken 9 3Jump not taken 3 1Call 9 3Reasons for Improvement•On chip cache–Faster loads & stores•More pipeliningCS 740 F’00– 14 –Pentium Block Diagram(Microcprocessor Report 10/28/92)MemoryDataBusCS 740 F’00– 15 –Pentium PipelineFetch & Align InstructionDecode Instr.Generate Control WordDecode Control WordGenerate Memory AddressAccess data cache orcalculate ALU resultWrite register resultDecode Control WordGenerate Memory AddressAccess data cache orcalculate ALU resultWrite register resultU-Pipe V-PipeCS 740 F’00– 16 –Superscalar ExecutionCan Execute Instructions I1 & I2 in Parallel if:•Both are “simple” instructions–Don’t require microcode sequencing–Some operations require U-pipe
View Full Document