Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2005 11 3 John Lazzaro www cs berkeley edu lazzaro TAs David Marquardt and Udam Saini www inst eecs berkeley edu cs152 CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Last Time Dynamic Scheduling Each line holds physical src1 src2 dest registers for an instruction and controls when it executes Reorder Buffer Inst src1 src1 val src2 src2 val dest dest val 6 7 From Memory Load Unit ALU 1 ALU 2 Store Unit To Memory Common Data Bus reg reg val Execution engine works on the physical registers not the architecture registers CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Today Throughput and multiple threads Goal Use multiple instruction streams to improve 1 throughput of machines that run many programs 2 execution time of multithreaded programs Example Sun Niagara 32 instruction streams on a chip Difficulties Gaining full advantage requires rewriting applications OS libraries Ultimate limiter Amdahl s law application dependent Memory system performance CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Throughput Computing Multithreading Interleave instructions from separate threads on the same hardware Seen by OS as several CPUs Multi core Integrating several processors that partially share a memory system on the same chip CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Multi Threading static pipelines CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Recall Bypass network prevents stalls Instead of bypass interleave threads on the pipeline to prevent stalls ID Decode EX IR WB MEM IR IR IR WE MemToReg From WB Mux Logic op 32 A 32 A L U 32 Y wd R Addr RegFile rs1 rd1 rs2 ws Data Memory Dout Din WE M rd2 MemToReg M WE Ext CS 152 L19 Advanced Processors III B UC Regents Fall 2005 UCB One way is to interleave execution of instructions from different program threads on same pipeline Introduced in 1964 by Seymour Cray Interleave 4 threads T1 T4 on non bypassed 5 stage pipe 4 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 CPUs Last instruction F D X MW T1 LW r1 0 r2 each in a thread F D X MW T2 ADD r7 r1 r4 always completes run at F D X MW T3 XORI r5 r4 12 writeback before 1 4 T4 SW 0 r7 r5 F D X MW next instruction T1 LW r5 12 r1 thread F D X M W in same Simple Multithreaded Pipeline clock reads regfile Krste November 10 2004 6 823 L18 5 PC PC PC 1 PC 1 1 1 I IR GPR1 GPR1 GPR1 GPR1 Y 1 2 Thread CS 152 L19 Advanced Processors III X 2 D Many variants UC Regents Fall 2005 UCB Multi Threading dynamic scheduling CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Power 4 predates Power 5 shown Tuesday Single threaded predecessor to Power 5 8 execution units in out of order engine each may issue an instruction each cycle Out of order processing Branch redirects Instruction fetch MP IF IC ISS RF BR EX Xfer WB Xfer WB Xfer WB Xfer LD ST BP D0 WB D1 D2 D3 Xfer Instruction crack and group formation GD MP ISS RF EA MP ISS RF EX MP ISS RF DC FX Fmt CP FP F6 Interrupts and flushes Figure 4 CS 152 L19 Advanced Processors III POWER4 instruction execution pipeline UC Regents Fall 2005 UCB For most apps most execution units lie idle For an 8 way superscalar 90 memory conflict 80 long fp short fp 70 long integer short integer 60 load delays 50 control hazards branch misprediction dcache miss 40 icache miss dtlb miss 30 itlb miss processor busy 20 Applications CS 152 L19 Advanced Processors III composite 0 li mdljdp2 mdljsp2 nasa7 ora su2cor swm tomcatv 10 alvinn doduc eqntott espresso fpppp hydro2d Percent of Total Issue Cycles Observation Most hardware in an out of order CPU concerns physical registers Could several instruction threads share this hardware 100 From Tullsen Eggers and Levy Simultaneous Multithreading Maximizing Onchip Parallelism ISCA 1995 UC Regents Fall 2005 UCB Simultaneous Multi threading One thread 8 units Cycle M M FX FX FP FP BR CC Two threads 8 units Cycle M M FX FX FP FP BR CC 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 M Load Store FX Fixed Point FP Floating Point BR Branch CC Condition Codes CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Branch redirects Power 4 Out of order processing Instruction fetch MP IF IC ISS RF BR EX D1 D2 D3 Xfer Instruction crack and group formation GD MP ISS RF EA MP ISS RF EX MP ISS RF DC Fmt FX Interrupts and flushes Xfer WB Xfer WB Xfer Power 5 Instruction fetch MP Load hitIFstore load that executes before an IC A younger BP older store to the same memory location has written its MP data to the caches must retrieve the data from the SDQ D0 D1 Xferwhether GD MP As loads execute they check theD2SRQD3to see there is any older store to the same memoryand location Group formation MP decode with data in the SDQ If one instruction is found the data is forwarded from the SDQ rather than from the cache If the data cannot be forwarded as is the case if the load and store instructions operate on overlapping memory Interrupts and flushes locations and the load data is not the same as or 2 fetch PC 2 initial decodes CS 152 L19 Advanced Processors III CP 2 commits Out of order processing architected register sets Branch POWER4 instruction execution pipeline WB FP F6 Branch Figure 4 redirects Xfer LD ST BP D0 WB ISS RF EX pipeline WB Xfer Load store pipeline Load hit load Two loads to the same memory location CP ISSmustRF EA theDC WB order Xfer and prevent observe memory Fmtreference a store to the memory location from another processor ISSbetween RF the EXintervening loads WB If the Xfer younger load Fixed point obtains old data the pipeline older load must not obtain ISSnew RF data This requirement is called sequential load consistency To guard against this LRQ entries for all F6 WB Xfer Floatingloads include a bit which if set indicates that a snoop has occurred to thepoint linepipeline containing the loaded data for that entry When a load instruction executes it UC Regents Fall 2005 UCB compute address DC data caches F6 six cycle floating point execution pipe Fmt data format WB write back and CP group commit Power 5 data flow Dynamic instruction selection Branch prediction Branch history tables Program counter Return stack Shared execution units Shared issue queues Target cache LSU0 Alternate Instruction cache Instruction translation FXU0 Instruction buffer 0 Data Cache LSU1 Group formation Instruction decode Dispatch Instruction buffer 1 Data Translation FXU1 FPU0 Group completion Store queue FPU1 BXU Thread priority


View Full Document

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 19 – Advanced Processors III and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 19 – Advanced Processors III and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?