Berkeley COMPSCI 252 - Lecture 18 - D2473125

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 252> Lecture 18

DOC PREVIEW

Berkeley COMPSCI 252 - Lecture 18

School name University of California, Berkeley

Course Compsci 252- Graduate Computer Architecture

Pages 29

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)Review: Dynamic Branch PredictionReview: Limits of ILPDynamic Scheduling in P6 (Pentium Pro, II, III)Dynamic Scheduling in P6P6 PipelineP6 Block DiagramPentium III Die PhotoP6 Performance: Stalls at decode stage I$ misses or lack of RS/Reorder buf. entryP6 Performance: uops/x86 instr 200 MHz, 8KI$/8KD$/256KL2$, 66 MHz busP6 Performance: Branch Mispredict RateP6 Performance: Speculation rate (% instructions issued that do not commit)P6 Performance: Cache Misses/1k instrP6 Performance: uops commit/clockP6 Dynamic Benefit? Sum of parts CPI vs. Actual CPIAdministratriviaAMD AlthonPentium 4Pentium 4 featuresPentium, Pentium Pro, Pentium 4 PipelineBlock Diagram of Pentium 4 MicroarchitecturePentium 4 Die PhotoBenchmarks: Pentium 4 v. PIII v. AlthonWhy?Another Approach: Mulithreaded Execution for ServersMultithreaded Example: IBM AS/400Slide 27Simultaneous Multithreading (SMT)SMT is comingCS252/PattersonLec 18.14/4/01CS252Graduate Computer ArchitectureLecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) April 4, 2001Prof. David A. PattersonComputer Science 252Spring 2001CS252/PattersonLec 18.24/4/01Review: Dynamic Branch Prediction•Prediction becoming important part of scalar execution•Branch History Table: 2 bits for loop accuracy•Correlation: Recently executed branches correlated with next branch.–Either different branches–Or different executions of same branches•Tournament Predictor: more resources to competitive solutions and pick between them•Branch Target Buffer: include branch address & prediction•Predicated Execution can reduce number of branches, number of mispredicted branches•Return address stack for prediction of indirect jumpCS252/PattersonLec 18.34/4/01Review: Limits of ILP•1985-2000: 1000X performance –Moore’s Law transistors/chip => Moore’s Law for Performance/MPU•Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism to get 1.55X/year–Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order execution, …•ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW?–Otherwise drop to old rate of 1.3X per year?–Less because of processor-memory performance gap?•Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?CS252/PattersonLec 18.44/4/01Dynamic Scheduling in P6 (Pentium Pro, II, III)• Q: How pipeline 1 to 17 byte 80x86 instructions?• P6 doesn’t pipeline 80x86 instructions• P6 decode unit translates the Intel instructions into 72-bit micro-operations (~ MIPS)• Sends micro-operations to reorder buffer & reservation stations• Many instructions translate to 1 to 4 micro-operations• Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations• 14 clocks in total pipeline (~ 3 state machines)CS252/PattersonLec 18.54/4/01Dynamic Scheduling in P6Parameter 80x86 microopsMax. instructions issued/clock 3 6Max. instr. complete exec./clock 5Max. instr. commited/clock 3Window (Instrs in reorder buffer) 40Number of reservations stations 20Number of rename registers 40No. integer functional units (FUs) 2No. floating point FUs 1No. SIMD Fl. Pt. FUs 1No. memory Fus 1 load + 1 storeCS252/PattersonLec 18.64/4/01P6 Pipeline•14 clocks in total (~3 state machines)•8 stages are used for in-order instruction fetch, decode, and issue–Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops)•3 stages are used for out-of-order execution in one of 5 separate functional units•3 stages are used for instruction commitInstrFetch16B/clkInstrDecode3 Instr/clkRenaming3 uops/clkExecu-tionunits(5)Gradu-ation3 uops/clk16B 6 uopsReserv.StationReorderBufferCS252/PattersonLec 18.74/4/01•IP = PCP6 Block DiagramFrom: http://www.digit-life.com/articles/pentium4/CS252/PattersonLec 18.84/4/01Pentium III Die Photo•EBL/BBL - Bus logic, Front, Back•MOB - Memory Order Buffer•Packed FPU - MMX Fl. Pt. (SSE)•IEU - Integer Execution Unit•FAU - Fl. Pt. Arithmetic Unit•MIU - Memory Interface Unit•DCU - Data Cache Unit•PMH - Page Miss Handler•DTLB - Data TLB•BAC - Branch Address Calculator•RAT - Register Alias Table•SIMD - Packed Fl. Pt.•RS - Reservation Station•BTB - Branch Target Buffer•IFU - Instruction Fetch Unit (+I$)•ID - Instruction Decode•ROB - Reorder Buffer•MS - Micro-instruction Sequencer1st Pentium III, Katmai: 9.5 M transistors, 12.3 * 10.4 mm in 0.25-mi. with 5 layers of aluminumCS252/PattersonLec 18.94/4/01P6 Performance: Stalls at decode stageI$ misses or lack of RS/Reorder buf. entry0 0.5 1 1.5 2 2.5 3wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo0.5 to 2.5 Stall cycles per instruction: 0.98 avg. (0.36 integer) Instruction stream Resource capacity stallsCS252/PattersonLec 18.104/4/01P6 Performance: uops/x86 instr200 MHz, 8KI$/8KD$/256KL2$, 66 MHz bus1 1.1 1.2 1.3 1.4 1.5 1.6 1.7wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo1.2 to 1.6 uops per IA-32 instruction: 1.36 avg. (1.37 integer)CS252/PattersonLec 18.114/4/01P6 Performance: Branch Mispredict Rate0% 5% 10% 15% 20% 25% 30% 35% 40% 45%wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo10% to 40% Miss/Mispredict ratio: 20% avg. (29% integer)BTB miss frequencyMispredict frequencyCS252/PattersonLec 18.124/4/01P6 Performance: Speculation rate(% instructions issued that do not commit)0% 10% 20% 30% 40% 50% 60%wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo1% to 60% instructions do not commit: 20% avg (30% integer)CS252/PattersonLec 18.134/4/01P6 Performance: Cache Misses/1k instr0 20 40 60 80 100 120 140 160wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo10 to 160 Misses per Thousand Instructions: 49 avg (30 integer)L1 InstructionL1 DataL2CS252/PattersonLec 18.144/4/01P6 Performance: uops commit/clockAverage0: 55%1: 13%2: 8%3: 23%Integer0: 40%1: 21%2: 12%3: 27%0% 20% 40% 60% 80%

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Berkeley COMPSCI 252 - Lecture 18

Sign up for free to view:

Please select your school