CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006 11 2 John Lazzaro www cs berkeley edu lazzaro TAs Udam Saini and Jue Sun www inst eecs berkeley edu cs152 CS 152 L19 Advanced Processors III UC Regents Fall 2006 UCB 1 Last Time Dynamic Scheduling Fetch up to 8 instructions per cycle Out of order processing Branch redirects Instruction fetch IF IC Execute up to 8 instructions per cycle Dispatch up to 5 instructions per cycle BP D0 D1 D2 D3 Xfer Group formation and instruction decode GD MP ISS RF EX MP ISS RF EA MP ISS RF EX MP ISS RF Branch pipeline Load store pipeline DC Fixed point pipeline F6 Interrupts and flushes Fmt Floatingpoint pipeline WB Xfer WB Xfer WB Xfer WB Xfer CP A BPthread may commit UpFigureto3 Power5 200instruction pipeline 240 physical IF instruction fetch IC instruction cache branch predict D0 decode stage 0 Xfer transfer GD group dispatch MP mapping ISS instruction issue RF register file read EX execute EA up toformat 5 WB instructions instructions registers compute address DC data caches F6 six cycle floating point execution pipe Fmt data write back and CP group commit per cycle in flight 120 int 120 FP CS 152 L18 Advanced Processors II UC Regents Fall 2006 UCB 2 Today Throughput and multiple threads Goal Use multiple instruction streams to improve 1 throughput of machines that run many programs 2 multi threaded program execution time Example Sun Niagara 32 instruction streams on a chip Difficulties Gaining full advantage requires rewriting applications OS libraries Ultimate limiter Amdahl s law application dependent Memory system performance CS 152 L19 Advanced Processors III UC Regents Fall 2006 UCB 3 Throughput Computing Multithreading Interleave instructions from separate threads on the same hardware Seen by OS as several CPUs Multi core Integrating several processors that partially share a memory system on the same chip CS 152 L19 Advanced Processors III UC Regents Fall 2006 UCB 4 Multi Threading Static Pipelines CS 152 L19 Advanced Processors III UC Regents Fall 2006 UCB 5 Recall Bypass network prevents stalls Instead of bypass Interleave threads on the pipeline to prevent stalls ID Decode EX IR WB MEM IR IR IR WE MemToReg From WB Mux Logic op 32 A 32 A L U 32 Y wd R Addr RegFile rs1 rd1 rs2 ws Data Memory Dout Din WE M rd2 MemToReg M WE Ext CS 152 L19 Advanced Processors III B UC Regents Fall 2006 UCB 6 One way is to interleave execution of instructions from different program threads on same pipeline Introduced in 1964 by Seymour Cray Interleave 4 threads T1 T4 on non bypassed 5 stage pipe 4 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 CPUs Last instruction F D X MW T1 LW r1 0 r2 each in a thread F D X MW T2 ADD r7 r1 r4 always completes run at F D X MW T3 XORI r5 r4 12 writeback before 1 4 T4 SW 0 r7 r5 F D X MW next instruction T1 LW r5 12 r1 thread F D X M W in same Simple Multithreaded Pipeline clock reads regfile Krste November 10 2004 6 823 L18 5 PC PC PC 1 PC 1 1 1 I IR GPR1 GPR1 GPR1 GPR1 Y 1 2 Thread CS 152 L19 Advanced Processors III select X 2 D Many variants UC Regents Fall 2006 UCB 7 Multi Threading Dynamic Scheduling CS 152 L19 Advanced Processors III UC Regents Fall 2006 UCB 8 Power 4 predates Power 5 shown Tuesday Single threaded predecessor to Power 5 8 execution units in out of order engine each may issue an instruction each cycle Out of order processing Branch redirects Instruction fetch MP IF IC ISS RF BR EX Xfer WB Xfer WB Xfer WB Xfer LD ST BP D0 WB D1 D2 D3 Xfer Instruction crack and group formation GD MP ISS RF EA MP ISS RF EX MP ISS RF DC FX Fmt CP FP F6 Interrupts and flushes Figure 4 CS 152 L19 Advanced Processors III POWER4 instruction execution pipeline UC Regents Fall 2006 UCB 9 For most apps most execution units lie idle For an 8 way superscalar 90 memory conflict 80 long fp short fp 70 long integer short integer 60 load delays 50 control hazards branch misprediction dcache miss 40 icache miss dtlb miss 30 itlb miss processor busy 20 Applications CS 152 L19 Advanced Processors III composite 0 li mdljdp2 mdljsp2 nasa7 ora su2cor swm tomcatv 10 alvinn doduc eqntott espresso fpppp hydro2d Percent of Total Issue Cycles Observation Most hardware in an out of order CPU concerns physical registers Could several instruction threads share this hardware 100 From Tullsen Eggers and Levy Simultaneous Multithreading Maximizing Onchip Parallelism ISCA 1995 UC Regents Fall 2006 UCB Figure 2 Sources of all unused issue cycles in an 8 issue superscalar processor Processor busy represents the utilized 10 Simultaneous Multi threading One thread 8 units Cycle M M FX FX FP FP BR CC Two threads 8 units Cycle M M FX FX FP FP BR CC 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 M Load Store FX Fixed Point FP Floating Point BR Branch CC Condition Codes CS 152 L19 Advanced Processors III UC Regents Fall 2006 UCB 11 Branch redirects Power 4 Out of order processing Instruction fetch MP IF IC ISS RF BR EX D1 D2 D3 Xfer Instruction crack and group formation GD MP ISS RF EA MP ISS RF EX MP ISS RF DC Fmt FX Interrupts and flushes Xfer WB Xfer WB Xfer Power 5 Instruction fetch MP Load hitIFstore load that executes before an IC A younger BP older store to the same memory location has written its MP data to the caches must retrieve the data from the SDQ D0 D1 Xferwhether GD MP As loads execute they check theD2SRQD3to see there is any older store to the same memoryand location Group formation MP decode with data in the SDQ If one instruction is found the data is forwarded from the SDQ rather than from the cache If the data cannot be forwarded as is the case if the load and store instructions operate on overlapping memory Interrupts and flushes locations and the load data is not the same as or CS 152 L19 Advanced Processors III contained within the store data the group containing 2 fetch PC 2 initial decodes CP 2 commits Out of order processing architected register sets Branch POWER4 instruction execution pipeline WB FP F6 Branch Figure 4 redirects Xfer LD ST BP D0 WB ISS RF EX pipeline WB Xfer Load store pipeline Load hit load Two loads to the same memory location CP ISSmustRF EA theDC WB order Xfer and prevent observe memory Fmtreference a store to the memory location from another processor ISSbetween RF the EXintervening loads WB If the Xfer younger load Fixed point obtains old data the pipeline older load must not obtain ISSnew RF data This requirement is called sequential load consistency To guard against this LRQ entries for all F6 …
View Full Document
Unlocking...