Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2004 11 18 Dave Patterson www cs berkeley edu patterson John Lazzaro www cs berkeley edu lazzaro www inst eecs berkeley edu cs152 CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 1 Last Time Dynamic Scheduling Each line holds physical src1 src2 dest registers for an instruction and controls when it executes Reorder Buffer Inst src1 src1 val src2 src2 val dest dest val 6 7 From Memory Load Unit ALU 1 ALU 2 Store Unit To Memory Common Data Bus dest dest val Execution engine works on the physical registers not the architecture registers CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 2 Recall Throughput and multiple threads Goal Use multiple instruction streams to improve 1 throughput of machines that run many programs 2 execution time of multithreaded programs Example Sun Niagara 32 instruction streams on a chip Difficulties Gaining full advantage requires rewriting applications OS libraries Ultimate limiter Amdahl s law application dependent Memory system performance CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 3 This Time Throughput Computing Multithreading Interleave instructions from separate threads on the same hardware Seen by OS as several CPUs Multi core Integrating several processors that partially share a memory system on the same chip Also A town meeting discussion on lessons learned from Lab 4 CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 4 Multi Threading CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 5 Power 4 predates Power 5 shown Tuesday Single threaded predecessor to Power 5 8 execution units in out of order engine each may issue an instruction each cycle Out of order processing Branch redirects Instruction fetch MP IF IC ISS RF BR EX BP D0 WB Xfer WB Xfer WB Xfer WB Xfer LD ST D1 D2 D3 Xfer GD Instruction crack and group formation MP ISS RF EA MP ISS RF EX MP ISS RF DC FX Fmt CP FP F6 Interrupts and flushes Figure 4 CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB POWER4 instruction execution pipeline 6 Load hit store A younger load that executes before an Load hit load Two loads to the same memory location For most apps most execution units lie idle For an 8 way superscalar 90 memory conflict 80 long fp short fp 70 long integer short integer 60 load delays control hazards 50 branch misprediction dcache miss 40 icache miss dtlb miss 30 itlb miss processor busy 20 Applications CS 152 L22 Advanced Processors III composite 0 li mdljdp2 mdljsp2 nasa7 ora su2cor swm tomcatv 10 alvinn doduc eqntott espresso fpppp hydro2d Percent of Total Issue Cycles Observation Most hardware in an out of order CPU concerns physical registers Could several instruction threads share this hardware 100 From Tullsen Eggers and Levy Simultaneous Multithreading Maximizing Onchip Parallelism ISCA 1995 UC Regents Fall 2004 UCB Figure 2 Sources of all unused issue cycles in an 8 issue superscalar processor Processor busy represents the utilized 7 issue slots a others represent wasted issue slots Simultaneous Multi threading One thread 8 units Cycle M M FX FX FP FP BR CC Two threads 8 units Cycle M M FX FX FP FP BR CC 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 M Load Store FX Fixed Point FP Floating Point BR Branch CC Condition Codes CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 8 Administrivia Big Game Go Cal Thursday 11 18 Preliminary design document due by 9 PM Friday 11 19 Review design document with TAs in lab section Sunday 11 21 Revised design document due in email by 11 59 PM Friday 12 3 Demo deep pipeline in lab section CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 9 Administrivia Mid term and Field Trip Mid Term II Review Session Sunday 11 21 7 9 PM 306 Soda no lecture Tuesday Mid Term II Tuesday 11 23 5 30 to 8 30 PM 101 Morgan LaVal s 9 PM Xilinx field trip Tuesday 11 30 bus leaves at 8 30 Send Doug RSVP AM from 4th floor Soda by 5PM today Thursday 12 2 Advice on Presentations Prepare you for your final project talk CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 10 Multi Threading continued CS 152 L22 Advanced Processors III UC Regents Fall 2004 UCB 11 Branch redirects Power 4 Out of order processing Instruction fetch MP IF IC ISS RF BR EX D1 D2 D3 Xfer Instruction crack and group formation GD MP ISS RF EA MP ISS RF EX MP ISS RF DC Fmt FX Interrupts and flushes Xfer WB Xfer WB Xfer Power 5 Instruction fetch CP 2 commits Out of order processing architected register sets Branch POWER4 instruction execution pipeline WB FP F6 Branch Figure 4 redirects Xfer LD ST BP D0 WB MP ISS RF EX pipeline WB Xfer Load store pipeline Load hit load Two loads to the same memory location CP ISS EA the DC WB order Xfer and prevent mustRF observe memory Fmtreference Load hitIFstore load that executes before an IC A younger BP older store to the same memory location has written its MP data to the caches must retrieve the data from the SDQ a store to the memory location from another processor D0 D1 Xferwhether GD MP ISS RF the EXintervening loads WB As loads execute they check theD2SRQD3 to see between If the Xfer younger load Fixed point there is any older store to the sameformation memoryand location obtains old data the pipeline older load must not obtain Group MP ISS decode with data in the SDQ If one instruction is found the data is new RF data This requirement is called sequential load forwarded from the SDQ rather than from the cache If consistency To guard against this LRQ entries for all F6 WB Xfer Floatingthe data cannot be forwarded as is the case if the load loads include a bit which if set indicates that a snoop and store instructions operate on overlapping memory has occurred to thepoint linepipeline containing the loaded data Interrupts and flushes locationsCSand the load data is not the same as or for that entry When a load instruction executes it 152 L22 Advanced Processors III UC Regents Fall 2004 UCB contained within the store data the group containing compares its load address against all addresses in the 12 Figure 3 Power5 instruction pipeline IF instruction fetch IC instruction branch predict entry D0 which decodehas stage the load instruction is flushed that is it and all younger LRQ Acache matchBP against a younger been 0 Xfer are transfer GD and group dispatch MPthe mapping ISS instruction issue RF register read EXload execute EA groups discarded refetched from instruction snooped indicates that


View Full Document

Berkeley COMPSCI 152 - Lecture 22 – Advanced Processors III

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 22 – Advanced Processors III and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 22 – Advanced Processors III and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?