DOC PREVIEW
Berkeley COMPSCI 252 - Lecture Notes

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)Review: Dynamic Branch PredictionReview: Limits of ILPPentium III Die PhotoDynamic Scheduling in P6 (Pentium Pro, II, III)Dynamic Scheduling in P6P6 PipelineP6 Block DiagramWhy does a P6 Stall?PPro Performance: Stalls at decode stage I$ misses or lack of RS/Reorder buf. entryPPro Performance: uops/x86 instr 200 MHz, 8KI$/8KD$/256KL2$, 66 MHz busWhy do few u-ops per inst?P6 Performance: Branch Mispredict RateP6 Performance: Speculation rate (% instructions issued that do not commit)PPro Performance: Cache Misses/1k instrPPro Performance: uops commit/clockPPro Dynamic Benefit? Sum of parts CPI vs. Actual CPIAdministratriviaAMD AlthonPentium 4Pentium 4 featuresPentium, Pentium Pro, Pentium 4 PipelineBlock Diagram of Pentium 4 MicroarchitecturePentium 4 Die PhotoBenchmarks: Pentium 4 v. PIII v. AlthonWhy?Another Approach: Mulithreaded Execution for ServersMultithreaded Example: IBM AS/400Slide 29Simultaneous Multithreading (SMT)SMT is comingHyperthreadingCS252/CullerLec 19.14/3/01CS252Graduate Computer ArchitectureLecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) April 4, 2001Prof. David A. PattersonComputer Science 252Spring 2001CS252/CullerLec 19.24/3/01Review: Dynamic Branch Prediction•Prediction becoming important part of scalar execution•Branch History Table: 2 bits for loop accuracy•Correlation: Recently executed branches correlated with next branch.–Either different branches–Or different executions of same branches•Tournament Predictor: more resources to competitive solutions and pick between them•Branch Target Buffer: include branch address & prediction•Predicated Execution can reduce number of branches, number of mispredicted branches•Return address stack for prediction of indirect jumpCS252/CullerLec 19.34/3/01Review: Limits of ILP•1985-2000: 1000X performance –Moore’s Law transistors/chip => Moore’s Law for Performance/MPU•Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism to get 1.55X/year–Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order execution, …•ILP limits: To make performance progress in future need to have explicit parallelism from programmer vs. implicit parallelism of ILP exploited by compiler, HW?–Otherwise drop to old rate of 1.3X per year?–Less because of processor-memory performance gap?•Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?CS252/CullerLec 19.44/3/01Pentium III Die Photo•EBL/BBL - Bus logic, Front, Back•MOB - Memory Order Buffer•Packed FPU - MMX Fl. Pt. (SSE)•IEU - Integer Execution Unit•FAU - Fl. Pt. Arithmetic Unit•MIU - Memory Interface Unit•DCU - Data Cache Unit•PMH - Page Miss Handler•DTLB - Data TLB•BAC - Branch Address Calculator•RAT - Register Alias Table•SIMD - Packed Fl. Pt.•RS - Reservation Station•BTB - Branch Target Buffer•IFU - Instruction Fetch Unit (+I$)•ID - Instruction Decode•ROB - Reorder Buffer•MS - Micro-instruction Sequencer1st Pentium III, Katmai: 9.5 M transistors, 12.3 * 10.4 mm in 0.25-mi. with 5 layers of aluminumCS252/CullerLec 19.54/3/01Dynamic Scheduling in P6 (Pentium Pro, II, III)• Q: How pipeline 1 to 17 byte 80x86 instructions?• P6 doesn’t pipeline 80x86 instructions• P6 decode unit translates the Intel instructions into 72-bit micro-operations (~ MIPS)• Sends micro-operations to reorder buffer & reservation stations• Many instructions translate to 1 to 4 micro-operations• Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations• 14 clocks in total pipeline (~ 3 state machines)CS252/CullerLec 19.64/3/01Dynamic Scheduling in P6Parameter 80x86 microopsMax. instructions issued/clock 3 6Max. instr. complete exec./clock 5Max. instr. commited/clock 3Window (Instrs in reorder buffer) 40Number of reservations stations 20Number of rename registers 40No. integer functional units (FUs) 2No. floating point FUs 1No. SIMD Fl. Pt. FUs 1No. memory Fus 1 load + 1 storeCS252/CullerLec 19.74/3/01P6 Pipeline•14 clocks in total (~3 state machines)•8 stages are used for in-order instruction fetch, decode, and issue–Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations (uops)•3 stages are used for out-of-order execution in one of 5 separate functional units•3 stages are used for instruction commitInstrFetch16B/clkInstrDecode3 Instr/clkRenaming3 uops/clkExecu-tionunits(5)Gradu-ation3 uops/clk16B 6 uopsReserv.StationReorderBufferCS252/CullerLec 19.84/3/01•IP = PCP6 Block DiagramFrom: http://www.digit-life.com/articles/pentium4/CS252/CullerLec 19.94/3/01Why does a P6 Stall?CS252/CullerLec 19.104/3/01PPro Performance: Stalls at decode stageI$ misses or lack of RS/Reorder buf. entry0 0.5 1 1.5 2 2.5 3wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo0.5 to 2.5 Stall cycles per instruction: 0.98 avg. (0.36 integer) Instruction stream Resource capacity stallsCS252/CullerLec 19.114/3/01PPro Performance: uops/x86 instr200 MHz, 8KI$/8KD$/256KL2$, 66 MHz bus1 1.1 1.2 1.3 1.4 1.5 1.6 1.7wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo1.2 to 1.6 uops per IA-32 instruction: 1.36 avg. (1.37 integer)CS252/CullerLec 19.124/3/01Why do few u-ops per inst?CS252/CullerLec 19.134/3/01P6 Performance: Branch Mispredict Rate0% 5% 10% 15% 20% 25% 30% 35% 40% 45%wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo10% to 40% Miss/Mispredict ratio: 20% avg. (29% integer)BTB miss frequencyMispredict frequencyCan you estimate the speculation rate?512 entry BTBCS252/CullerLec 19.144/3/01P6 Performance: Speculation rate(% instructions issued that do not commit)0% 10% 20% 30% 40% 50% 60%wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo1% to 60% instructions do not commit: 20% avg (30% integer)CS252/CullerLec 19.154/3/01PPro Performance: Cache Misses/1k instr0 20 40 60 80 100 120 140 160wave5fppppapsiturb3dapplumgridhydro2dsu2corswimtomcatvvortexperlijpeglicompressgccm88ksimgo10 to 160 Misses per Thousand Instructions: 49 avg (30 integer)L1 InstructionL1 DataL2CS252/CullerLec


View Full Document

Berkeley COMPSCI 252 - Lecture Notes

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?