DOC PREVIEW
Berkeley COMPSCI 252 - Lecture Notes

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS252 Graduate Computer Architecture Review Dynamic Branch Prediction Prediction becoming important part of scalar execution Branch History Table 2 bits for loop accuracy Correlation Recently executed branches correlated with next branch Lecture 18 ILP and Dynamic Execution 3 Examples Pentium III Pentium 4 IBM AS 400 Either different branches Or different executions of same branches April 4 2001 Prof David A Patterson Computer Science 252 Spring 2001 CS252 Culler Lec 19 1 4 3 01 4 3 01 Tournament Predictor more resources to competitive solutions and pick between them Branch Target Buffer include branch address prediction Predicated Execution can reduce number of branches number of mispredicted branches Return address stack for prediction of indirect jump CS252 Culler Lec 19 2 Pentium III Die Photo Review Limits of ILP 1985 2000 1000X performance Moore s Law transistors chip Moore s Law for Performance MPU Hennessy industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism to get 1 55X year Caches Pipelining Superscalar Branch Prediction Out of order execution ILP limits To make performance progress in future need to have explicit parallelism from programmer vs implicit parallelism of ILP exploited by compiler HW Otherwise drop to old rate of 1 3X per year Less because of processor memory performance gap Impact on you if you care about performance better think about explicitly parallel algorithms vs rely on ILP 4 3 01 1st Pentium III Katmai 9 5 M transistors 12 3 mm in 0 25 mi with 5 layers of aluminum CS252 Culler Lec 19 3 4 3 01 EBL BBL Bus logic Front Back MOB Memory Order Buffer Packed FPU MMX Fl Pt SSE IEU Integer Execution Unit FAU Fl Pt Arithmetic Unit MIU Memory Interface Unit DCU Data Cache Unit PMH Page Miss Handler DTLB Data TLB BAC Branch Address Calculator RAT Register Alias Table SIMD Packed Fl Pt RS Reservation Station BTB Branch Target Buffer IFU Instruction Fetch Unit I ID Instruction Decode ROB Reorder Buffer MS Micro instruction Sequencer 4 3 01 10 4 Dynamic Scheduling in P6 Pentium Pro II III Dynamic Scheduling in P6 Q How pipeline 1 to 17 byte 80x86 instructions P6 doesn t pipeline 80x86 instructions P6 decode unit translates the Intel instructions into 72 bit micro operations MIPS Sends micro operations to reorder buffer reservation stations Many instructions translate to 1 to 4 micro operations Complex 80x86 instructions are executed by a conventional microprogram 8K x 72 bits that issues long sequences of microoperations 14 clocks in total pipeline 3 state machines Parameter 80x86 microops Max instructions issued clock 3 6 Max instr complete exec clock 5 Max instr commited clock 3 Window Instrs in reorder buffer 40 Number of reservations stations 20 Number of rename registers 40 No integer functional units FUs 2 No floating point FUs 1 No SIMD Fl Pt FUs 1 No memory Fus 1 load 1 store CS252 Culler Lec 19 5 4 3 01 Page 1 CS252 Culler Lec 19 4 CS252 Culler Lec 19 6 P6 Block Diagram P6 Pipeline IP PC 14 clocks in total 3 state machines 8 stages are used for in order instruction fetch decode and issue Takes 1 clock cycle to determine length of 80x86 instructions 2 more to create the micro operations uops 3 stages are used for out of order execution in one of 5 separate functional units 3 stages are used for instruction commit Instr Fetch 16B clk 16B Instr 6 uops Decode 3 Instr clk Renaming 3 uops clk Reserv Reorder ExecuGraduStation Buffer tion ation units 3 uops 5 clk From http www digitlife com articles pentium4 CS252 Culler Lec 19 7 4 3 01 CS252 Culler Lec 19 8 4 3 01 PPro Performance Stalls at decode stage I misses or lack of RS Reorder buf entry Why does a P6 Stall go Instruction stream m88ksim Resource capacity stalls gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 0 CS252 Culler Lec 19 9 4 3 01 4 3 01 PPro Performance uops x86 instr 200 MHz 8KI 8KD 256KL2 66 MHz bus 0 5 1 1 5 2 2 5 3 0 5 to 2 5 Stall cycles per instruction 0 98 avg 0 36 integer CS252 Culler Lec 19 10 Why do few u ops per inst go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 1 4 3 01 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 2 to 1 6 uops per IA 32 instruction 1 36 avg 1 37 integer CS252 Culler 4 3 01 Lec 19 11 Page 2 CS252 Culler Lec 19 12 P6 Performance Speculation rate instructions issued that do not commit P6 Performance Branch Mispredict Rate go go m88ksim m88ksim gcc gcc compress compress li li ijpeg ijpeg perl perl vortex vortex tomcatv tomcatv swim swim su2cor su2cor BTB miss frequency Mispredict frequency hydro2d mgrid hydro2d mgrid applu applu 512 entry BTB turb3d turb3d apsi apsi fpppp fpppp wave5 wave5 0 5 10 15 20 25 30 35 40 45 0 10 to 40 Miss Mispredict ratio 20 avg 29 integer Can you estimate the speculation rate 4 3 01 CS252 Culler Lec 19 13 10 20 30 40 50 1 to 60 instructions do not commit 20 avg 30 integer 4 3 01 PPro Performance Cache Misses 1k instr 60 CS252 Culler Lec 19 14 PPro Performance uops commit clock go go m88ksim m88ksim gcc gcc L1 Instruction L1 Data L2 compress li ijpeg compress li ijpeg perl perl vortex vortex tomcatv tomcatv swim 0 uops commit 1 uop commits 2 uops commit 3 uops commit swim su2cor su2cor hydro2d hydro2d mgrid mgrid applu Average 0 55 1 13 2 8 3 23 applu turb3d turb3d apsi apsi fpppp fpppp wave5 wave5 0 20 40 60 80 100 120 140 160 0 10 to 160 Misses per Thousand Instructions 49 avg 30 integer CS252 Culler Lec 19 15 4 3 01 go gcc compress li ijpeg perl uops Instruction cache stalls Resource capacity stalls Branch mispredict penalty Data Cache Stalls vortex tomcatv swim su2cor hydro2d mgrid applu Actual CPI 40 60 80 100 CS252 Culler Lec 19 16 4 3 01 PPro Dynamic Benefit Sum of parts CPI vs Actual CPI m88ksim 20 Integer 0 40 1 21 2 12 3 27 Administratrivia Ratio of sum of parts vs actual CPI 1 38X avg 1 29X integer 6 9 Should have been in today s reading Term Project Presentation Format turb3d apsi fpppp wave5 0 0 5 1 1 5 2 2 5 3 3 5 4 4 5 5 5 5 6 0 8 to 3 8 Clock cycles per instruction 1 68 avg 1 16 integer 4 3 01 CS252 Culler Lec 19 17 4 3 01 Page 3 CS252 Culler Lec 19 18 Pentium 4 AMD Althon Similar to P6 microarchitecture Pentium III but more resources Still translate from 80x86 to micro ops P4 has better branch predictor more FUs Transistors PIII 24M v Althon 37M Die Size 106 mm2 v 117 mm2 …


View Full Document

Berkeley COMPSCI 252 - Lecture Notes

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?