CMU CS 15740 - Lecture - D2661528

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15740> Lecture

DOC PREVIEW

CMU CS 15740 - Lecture

School name Carnegie Mellon University

Course Cs 15740- Computer Architecture

Pages 65

This preview shows page 1-2-3-4-30-31-32-33-34-62-63-64-65 out of 65 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 65 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Superscalar Processing CS 740 September 25 27 2000 Intel Processors 486 Pentium Pentium Pro Superscalar Processor Design Use PowerPC 604 as case study Speculative Execution Register Renaming Branch Prediction More Superscalar Examples MIPS R10000 DEC Alpha 21264 Intel x86 Processors Processor YearTransistorsMHzSpec92 Int FP Spec95 Int FP 8086 78 29K4 Basis of IBM PC PC XT i286 83 134K8 Basis of IBM PC AT i386 86 275K16 88 33 6 3 i486 89 1 2M20 50 28 13 Pentium 93 3 1M6678 64 150 181 1254 3 3 0 PentiumPro 955 5M 200 320 2838 2 6 0 Pentium II 97 7 5M300 11 6 6 8 Merced 00 14M 2 150245 2206 1 4 8 CS 740 F 00 Other Processors Processor YearTransistors MHz Spec92 Spec95 MIPS R3000 88 25 16 1 21 7 DecStation 5000 120 MIPS R5000 3 6M 180 4 1 4 4 Wean Hall SGIs MIPS R10000 95 5 9M 200 300 600 8 9 17 2 Most Advanced MIPS Alpha 21164a 96 9 3M 417 500 750 11 17 500 12 6 18 3 Fastest Available Alpha 21264 97 15M 500 30 60 Fastest Announced 3 CS 740 F 00 Architectural Performance Metric SpecX92 Mhz Normalizes with respect to clock speed But one measure of good arch is how fast can run clock Sampling Processor MHz SpecInt92 i386 387 33 6 i486DX 50 28 Pentium 150 181 PentiumPro 200 320 MIPS R3000A 25 16 1 MIPS R10000200 300 Alpha 21164a417 500 4 IntAP SpecFP92 0 2 3 0 6 13 1 2 125 1 6 283 0 6 21 7 1 5 600 1 2 750 FltAP 0 1 0 3 0 8 1 4 0 9 3 0 1 8 CS 740 F 00 x86 ISA Characteristics Multiple Data Sizes and Addressing Methods Recent generations optimized for 32 bit mode Limited Number of Registers Stack oriented procedure call and FP instructions Programs reference memory heavily 41 Variable Length Instructions First few bytes describe operation and operands Remaining ones give immediate data address displacements Average is 2 5 bytes 5 CS 740 F 00 i486 Pipeline Fetch Load 16 bytes of instruction into prefetch buffer Decode1 Determine instruction length instruction type Decode2 Compute memory address Generate immediate operands Execute Register Read ALU operation Memory read write Write Back Update register file 6 CS 740 F 00 Pipeline Stage Details Fetch Moves 16 bytes of instruction stream into code queue Not required every time About 5 instructions fetched at once Only useful if don t branch Avoids need for separate instruction cache D1 Determine total instruction length Signals code queue aligner where next instruction begins May require two cycles When multiple operands must be decoded About 6 of typical DOS program 7 CS 740 F 00 Stage Details Cont D2 Extract memory displacements and immediate operands Compute memory addresses Add base register and possibly scaled index register May require two cycles If index register involved or both address immediate operand Approx 5 of executed instructions EX Read register operands Compute ALU function Read or write memory data cache WB Update register result 8 CS 740 F 00 Data Hazards Data Hazards Generated ALU Load ALU ALU 9 Used ALU ALU Store Eff Address Handling EX EX Forwarding EX EX Forwarding EX EX Forwarding Stall EX ID2 Forwarding CS 740 F 00 Control Hazards Jump Instr Jump 1 Jump 2 Target ID1 ID2 EX ID1 ID2 ID1 Fetch Jump Instruction Processsing Continue pipeline assuming branch not taken Resolve branch condition in EX stage Also speculatively fetch at target during EX stage 10 CS 740 F 00 Control Hazards Cont Branch Not Taken Jump Instr Allow pipeline to continue Jump 1 Total of 1 cycle for instruction ID1 ID2 EX ID1 ID2 EX ID1 ID2 Jump 2 Jump 3 ID1 Target Branch taken Jump Instr Flush instructions in pipe Jump 1 Begin ID1 at target Jump 2 Total of 3 cycles for instruction Target 11 Fetch Flushed ID1 ID2 EX ID1 ID2 Flushed ID1 Flushed Fetch ID1 CS 740 F 00 Comparison with Our pAlpha Pipeline Two Decoding Stages Harder to decode CISC instructions Effective address calculation in D2 Multicycle Decoding Stages For more difficult decodings Stalls incoming instructions Combined Mem EX Stage Avoids load stall without load delay slot But introduces stall for address computation 12 CS 740 F 00 Comparison to 386 Cycles Per Instruction Instruction Type 386 Cycles Load 4 Store 2 ALU 2 Jump taken 9 Jump not taken 3 Call 9 486 Cycles 1 1 1 3 1 3 Reasons for Improvement On chip cache Faster loads stores More pipelining 13 CS 740 F 00 Pentium Block Diagram Memory Data Bus Microcprocessor Report 10 28 92 14 CS 740 F 00 Pentium Pipeline Fetch Align Instruction Decode Instr Generate Control Word Decode Control Word Generate Memory Address Decode Control Word Generate Memory Address Access data cache or calculate ALU result Access data cache or calculate ALU result Write register result Write register result U Pipe V Pipe 15 CS 740 F 00 Superscalar Execution Can Execute Instructions I1 I2 in Parallel if Both are simple instructions Don t require microcode sequencing Some operations require U pipe resources 90 of SpecInt instructions I1 is not a jump Destination of I1 not source of I2 But can handle I1 setting CC and I2 being cond jump Destination of I1 not destination of I2 If Conditions Don t Hold Issue I1 to U Pipe I2 issued on next cycle Possibly paired with following instruction 16 CS 740 F 00 Branch Prediction Branch Target Buffer Stores information about previously executed branches Indexed by instruction address Specifies branch destination whether or not taken 256 entries Branch Processing Look for instruction in BTB If found start fetching at destination Branch condition resolved early in WB If prediction correct no branch penalty If prediction incorrect lose 3 cycles Which corresponds to 3 instructions Update BTB 17 CS 740 F 00 Superscalar Terminology Basic Superscalar Able to issue 1 instruction cycle Superpipelined Deep but not superscalar pipeline E g MIPS R5000 has 8 stages Branch prediction Logic to guess whether or not branch will be taken and possibly branch target Advanced Out of order Able to issue instructions out of program order Speculation Execute instructions beyond branch points possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete 18 CS 740 F 00 Superscalar Execution Example Data Flow Assumptions f2 Single FP adder takes 2 cycles Single FP multipler takes 5 cycles Can issue add multiply together Must issue in order v Critical Path 9 cycles Single adder data dependence In order f4 f6 f4 f8 w x y z z f10 v addt w mult f10 f6 f10 x addt f10 f8 f12 y addt f4 f6 z addt f4 f8 f10 19 f2 f4 f10 f4 v f12 w x inorder y z CS 740 F

View Full Document