Superscalar Processing CS 740 September 23 25 2002 Intel Processors 486 Pentium Pentium Pro Superscalar Processor Design Use PowerPC 604 as case study Speculative Execution Register Renaming Branch Prediction More Superscalar Examples MIPS R10000 DEC Alpha 21264 Superscalar Terminology Basic Superscalar Superpipelined Branch prediction Able to issue 1 instruction cycle Deep but not superscalar pipeline E g MIPS R5000 has 8 stages Logic to guess whether or not branch will be taken and possibly branch target Advanced Out of order Speculation Able to issue instructions out of program order Execute instructions beyond branch points possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete 2 CS 740 F 02 Page 1 Intel x86 Processors Processor 8086 YearTransistorsMHz Spec92 Int FP Spec95 Int FP 78 29K 4 Basis of IBM PC PC XT i286 83 134K 8 Basis of IBM PC AT i386 i486 86 88 89 Pentium 93 275K 16 33 1 2M 20 50 3 1M 66 150 5 5M 150 200 7 5M 300 14M PentiumPro 95 Pentium II 97 IA 65 02 6 3 28 13 78 64 181 125 245 220 320 283 4 3 6 1 8 2 11 6 3 3 0 4 8 6 0 6 8 CS 740 F 02 Other Processors Processor Year Transistors MIPS R3000 88 DecStation 5000 120 MIPS R5000 3 6M Wean Hall SGIs MIPS R10000 95 5 9M Most Advanced MIPS Alpha 21164a 96 9 3M Fastest Available Alpha 21264 97 Fastest Announced 15M MHz Spec92 25 16 1 21 7 Spec95 180 4 1 4 4 200 300 600 8 9 17 2 417 500 500 750 11 17 12 6 18 3 500 4 30 60 CS 740 F 02 Page 2 Architectural Performance Metric SpecX92 Mhz Normalizes with respect to clock speed But one measure of good arch is how fast can run clock Sampling Processor i386 387 i486DX Pentium PentiumPro MIPS R3000A MIPS R10000 Alpha 21164a MHz 33 50 150 200 25 200 417 SpecInt92 6 28 181 320 16 1 300 500 IntAP 0 2 0 6 1 2 1 6 0 6 1 5 1 2 SpecFP92 3 13 125 283 21 7 600 750 5 FltAP 0 1 0 3 0 8 1 4 0 9 3 0 1 8 CS 740 F 02 x86 ISA Characteristics Multiple Data Sizes and Addressing Methods Recent generations optimized for 32 bit mode Limited Number of Registers Stack oriented procedure call and FP instructions Programs reference memory heavily 41 Variable Length Instructions First few bytes describe operation and operands Remaining ones give immediate data address displacements Average is 2 5 bytes 6 CS 740 F 02 Page 3 i486 Pipeline Fetch Load 16 bytes of instruction into prefetch buffer Decode1 Determine instruction length instruction type Decode2 Compute memory address Generate immediate operands Execute Register Read ALU operation Memory read write Write Back Update register file 7 CS 740 F 02 Pipeline Stage Details Fetch Moves 16 bytes of instruction stream into code queue Not required every time About 5 instructions fetched at once Only useful if don t branch Avoids need for separate instruction cache D1 Determine total instruction length Signals code queue aligner where next instruction begins May require two cycles When multiple operands must be decoded About 6 of typical DOS program 8 CS 740 F 02 Page 4 Stage Details Cont D2 Extract memory displacements and immediate operands Compute memory addresses Add base register and possibly scaled index register May require two cycles If index register involved or both address immediate operand Approx 5 of executed instructions EX Read register operands Compute ALU function Read or write memory data cache WB Update register result 9 CS 740 F 02 Data Hazards Data Hazards Generated ALU Load ALU ALU Used ALU ALU Store Eff Address Handling EX EX Forwarding EX EX Forwarding EX EX Forwarding Stall EX ID2 Forwarding 10 CS 740 F 02 Page 5 Control Hazards Jump Instr Jump 1 ID1 ID2 EX ID1 ID2 Jump 2 Target ID1 Fetch Jump Instruction Processsing Continue pipeline assuming branch not taken Resolve branch condition in EX stage Also speculatively fetch at target during EX stage 11 CS 740 F 02 Control Hazards Cont Branch Not Taken Jump Instr Allow pipeline to continue Jump Total of 1 cycle for instruction ID1 1 ID2 EX ID1 ID2 EX ID1 ID2 Jump 2 Jump 3 ID1 Target Branch taken Jump Instr Flush instructions in pipe Jump Begin ID1 at target Total of 3 cycles for instructionJump 1 2 Target 12 Fetch ID1 Flushed ID2 EX ID1 ID2 Flushed ID1 Flushed Fetch ID1 CS 740 F 02 Page 6 Comparison with Our pAlpha Pipeline Two Decoding Stages Harder to decode CISC instructions Effective address calculation in D2 Multicycle Decoding Stages For more difficult decodings Stalls incoming instructions Combined Mem EX Stage Avoids load stall without load delay slot But introduces stall for address computation 13 CS 740 F 02 Comparison to 386 Cycles Per Instruction Instruction Type Load Store ALU Jump taken Jump not taken Call 386 Cycles 4 2 2 9 3 9 486 Cycles 1 1 1 3 1 3 Reasons for Improvement On chip cache Faster loads stores More pipelining 14 CS 740 F 02 Page 7 Pentium Block Diagram Memory Data Bus Microcprocessor Report 10 28 92 15 CS 740 F 02 Pentium Pipeline Fetch Align Instruction Decode Instr Generate Control Word Decode Control Word Generate Memory Address Decode Control Word Generate Memory Address Access data cache or calculate ALU result Access data cache or calculate ALU result Write register result Write register result U Pipe V Pipe 16 CS 740 F 02 Page 8 Superscalar Execution Can Execute Instructions I1 I2 in Parallel if Both are simple instructions Don t require microcode sequencing Some operations require U pipe resources 90 of SpecInt instructions I1 is not a jump Destination of I1 not source of I2 But can handle I1 setting CC and I2 being cond jump Destination of I1 not destination of I2 If Conditions Don t Hold Issue I1 to U Pipe I2 issued on next cycle Possibly paired with following instruction 17 CS 740 F 02 Branch Prediction Branch Target Buffer Stores information about previously executed branches Indexed by instruction address Specifies branch destination whether or not taken 256 entries Branch Processing Look for instruction in BTB If found start fetching at destination Branch condition resolved early in WB If prediction correct no branch penalty If prediction incorrect lose 3 cycles Which corresponds to 3 instructions Update BTB 18 CS 740 F 02 Page 9 Superscalar Execution Example Data Flow Assumptions f2 Single FP adder takes 2 cycles Single FP multipler takes 5 cycles Can issue add multiply together Must issue in order Critical Path 9 cycles Single adder data dependence In order v f4 f6 f4 f8 w x y z z f10 v addt w mult f10 f6 f10 x addt f10 f8 f12
View Full Document
Unlocking...