Superscalar Processing Intel x86 Processors CS 740 September 25 27 2000 Processor 8086 Year TransistorsMHz Spec92 Int FP Spec95 Int FP 78 29K 4 Basis of IBM PC PC XT i286 83 134K 8 i386 275K i486 86 88 89 1 2M Pentium 93 3 1M PentiumPro 95 5 5M Pentium II 97 Merced 00 7 5M 14M 16 33 20 50 66 150 150 200 300 Basis of IBM PC AT Intel Processors 486 Pentium Pentium Pro Superscalar Processor Design Use PowerPC 604 as case study Speculative Execution Register Renaming Branch Prediction More Superscalar Examples MIPS R10000 DEC Alpha 21264 6 3 28 13 78 64 181 125 245 220 320 283 Processor Year Transistors MIPS R3000 88 DecStation 5000 120 MIPS R5000 3 6M Wean Hall SGIs MIPS R10000 95 5 9M Most Advanced MIPS Alpha 21164a 96 9 3M Fastest Available Alpha 21264 97 Fastest Announced 3 15M 3 0 4 8 6 0 6 8 CS 740 F 00 Architectural Performance MHz Spec92 25 16 1 21 7 Metric Spec95 180 4 1 4 4 200 300 600 8 9 17 2 417 500 500 750 500 2 Other Processors 4 3 6 1 8 2 11 6 SpecX92 Mhz Normalizes with respect to clock speed But one measure of good arch is how fast can run clock Sampling Processor i386 387 i486DX Pentium PentiumPro MIPS R3000A MIPS R10000 Alpha 21164a 11 17 12 6 18 3 30 60 CS 740 F 00 4 Page 1 MHz 33 50 150 200 25 200 417 SpecInt92 6 28 181 320 16 1 300 500 IntAP 0 2 0 6 1 2 1 6 0 6 1 5 1 2 SpecFP92 3 13 125 283 21 7 600 750 FltAP 0 1 0 3 0 8 1 4 0 9 3 0 1 8 CS 740 F 00 x86 ISA Characteristics i486 Pipeline Multiple Data Sizes and Addressing Methods Fetch Recent generations optimized for 32 bit mode Load 16 bytes of instruction into prefetch buffer Limited Number of Registers Decode1 Stack oriented procedure call and FP instructions Programs reference memory heavily 41 Determine instruction length instruction type Decode2 Variable Length Instructions Compute memory address Generate immediate operands First few bytes describe operation and operands Remaining ones give immediate data address displacements Average is 2 5 bytes Execute Register Read ALU operation Memory read write Write Back Update register file 5 CS 740 F 00 6 Pipeline Stage Details CS 740 F 00 Stage Details Cont Fetch D2 Moves 16 bytes of instruction stream into code queue Not required every time About 5 instructions fetched at once Only useful if don t branch Avoids need for separate instruction cache Extract memory displacements and immediate operands Compute memory addresses Add base register and possibly scaled index register May require two cycles If index register involved or both address immediate operand Approx 5 of executed instructions D1 EX Determine total instruction length Signals code queue aligner where next instruction begins May require two cycles When multiple operands must be decoded About 6 of typical DOS program Read register operands Compute ALU function Read or write memory data cache WB Update register result 7 CS 740 F 00 8 Page 2 CS 740 F 00 Data Hazards Control Hazards Data Hazards Generated ALU Load ALU ALU Used ALU ALU Store Eff Address Jump Instr Handling EX EX Forwarding EX EX Forwarding EX EX Forwarding Stall EX ID2 Forwarding Jump 1 ID1 ID2 EX ID1 ID2 Jump 2 Target ID1 Fetch Jump Instruction Processsing Continue pipeline assuming branch not taken Resolve branch condition in EX stage Also speculatively fetch at target during EX stage 9 CS 740 F 00 10 Control Hazards Cont Branch Not Taken Jump Instr Allow pipeline to continue Jump Total of 1 cycle for instruction ID1 1 ID1 ID2 EX ID1 ID2 Harder to decode CISC instructions Effective address calculation in D2 Multicycle Decoding Stages ID1 Target Fetch For more difficult decodings Stalls incoming instructions Flushed Combined Mem EX Stage Jump Instr Flush instructions in pipe Jump Begin ID1 at target Total of 3 cycles for instructionJump 1 2 Target 11 Two Decoding Stages EX Jump 3 Branch taken Comparison with Our pAlpha Pipeline ID2 Jump 2 CS 740 F 00 ID1 ID2 EX ID1 ID2 Flushed ID1 Flushed Fetch Avoids load stall without load delay slot But introduces stall for address computation ID1 CS 740 F 00 12 Page 3 CS 740 F 00 Comparison to 386 Pentium Block Diagram Cycles Per Instruction Instruction Type Load Store ALU Jump taken Jump not taken Call 386 Cycles 4 2 2 9 3 9 486 Cycles 1 1 1 3 1 3 Memory Data Bus Reasons for Improvement On chip cache Faster loads stores More pipelining Microcprocessor Report 10 28 92 13 CS 740 F 00 14 Pentium Pipeline CS 740 F 00 Superscalar Execution Can Execute Instructions I1 I2 in Parallel if Fetch Align Instruction Both are simple instructions Don t require microcode sequencing Some operations require U pipe resources 90 of SpecInt instructions I1 is not a jump Destination of I1 not source of I2 But can handle I1 setting CC and I2 being cond jump Destination of I1 not destination of I2 Decode Instr Generate Control Word Decode Control Word Generate Memory Address Decode Control Word Generate Memory Address Access data cache or calculate ALU result Access data cache or calculate ALU result Write register result Write register result U Pipe V Pipe 15 If Conditions Don t Hold Issue I1 to U Pipe I2 issued on next cycle Possibly paired with following instruction CS 740 F 00 16 Page 4 CS 740 F 00 Branch Prediction Superscalar Terminology Branch Target Buffer Basic Stores information about previously executed branches Indexed by instruction address Specifies branch destination whether or not taken 256 entries Superscalar Superpipelined Branch prediction Branch Processing Advanced Look for instruction in BTB If found start fetching at destination Branch condition resolved early in WB If prediction correct no branch penalty If prediction incorrect lose 3 cycles Which corresponds to 3 instructions Update BTB Out of order Speculation Able to issue instructions out of program order Execute instructions beyond branch points possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete 17 CS 740 F 00 18 Superscalar Execution Example f2 Single FP adder takes 2 cycles Single FP multipler takes 5 cycles Can issue add multiply together Must issue in order Critical Path 9 cycles Single adder data dependence In order v f4 x Can start y as soon as adder available Must hold back z until f10 not busy adder available y f4 f8 w Out Of Order Issue f6 z z addt w mult f10 f6 f10 x addt f10 f8 f12 y addt f4 f6 z addt f4 f8 f10 19 f2 f4 f10 f4 v f12 w x inorder v addt w mult f10 f6 f10 x addt f10 f8 f12 y addt f4 f6 z addt f4 f8 f10
View Full Document
Unlocking...