Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III - D1651131

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Lecture 19 – Advanced Processors III

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 29

Download Save

Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2005 11 3 John Lazzaro www cs berkeley edu lazzaro TAs David Marquardt and Udam Saini www inst eecs berkeley edu cs152 CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Last Time Dynamic Scheduling Each line holds physical src1 src2 dest registers for an instruction and controls when it executes Reorder Buffer Inst src1 src1 val src2 src2 val dest dest val 6 7 From Memory Load Unit ALU 1 ALU 2 Store Unit To Memory Common Data Bus reg reg val Execution engine works on the physical registers not the architecture registers CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Today Throughput and multiple threads Goal Use multiple instruction streams to improve 1 throughput of machines that run many programs 2 execution time of multithreaded programs Example Sun Niagara 32 instruction streams on a chip Difficulties Gaining full advantage requires rewriting applications OS libraries Ultimate limiter Amdahl s law application dependent Memory system performance CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Throughput Computing Multithreading Interleave instructions from separate threads on the same hardware Seen by OS as several CPUs Multi core Integrating several processors that partially share a memory system on the same chip CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Multi Threading static pipelines CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Recall Bypass network prevents stalls Instead of bypass interleave threads on the pipeline to prevent stalls ID Decode EX IR WB MEM IR IR IR WE MemToReg From WB Mux Logic op 32 A 32 A L U 32 Y wd R Addr RegFile rs1 rd1 rs2 ws Data Memory Dout Din WE M rd2 MemToReg M WE Ext CS 152 L19 Advanced Processors III B UC Regents Fall 2005 UCB One way is to interleave execution of instructions from different program threads on same pipeline Introduced in 1964 by Seymour Cray Interleave 4 threads T1 T4 on non bypassed 5 stage pipe 4 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 CPUs Last instruction F D X MW T1 LW r1 0 r2 each in a thread F D X MW T2 ADD r7 r1 r4 always completes run at F D X MW T3 XORI r5 r4 12 writeback before 1 4 T4 SW 0 r7 r5 F D X MW next instruction T1 LW r5 12 r1 thread F D X M W in same Simple Multithreaded Pipeline clock reads regfile Krste November 10 2004 6 823 L18 5 PC PC PC 1 PC 1 1 1 I IR GPR1 GPR1 GPR1 GPR1 Y 1 2 Thread CS 152 L19 Advanced Processors III X 2 D Many variants UC Regents Fall 2005 UCB Multi Threading dynamic scheduling CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Power 4 predates Power 5 shown Tuesday Single threaded predecessor to Power 5 8 execution units in out of order engine each may issue an instruction each cycle Out of order processing Branch redirects Instruction fetch MP IF IC ISS RF BR EX Xfer WB Xfer WB Xfer WB Xfer LD ST BP D0 WB D1 D2 D3 Xfer Instruction crack and group formation GD MP ISS RF EA MP ISS RF EX MP ISS RF DC FX Fmt CP FP F6 Interrupts and flushes Figure 4 CS 152 L19 Advanced Processors III POWER4 instruction execution pipeline UC Regents Fall 2005 UCB For most apps most execution units lie idle For an 8 way superscalar 90 memory conflict 80 long fp short fp 70 long integer short integer 60 load delays 50 control hazards branch misprediction dcache miss 40 icache miss dtlb miss 30 itlb miss processor busy 20 Applications CS 152 L19 Advanced Processors III composite 0 li mdljdp2 mdljsp2 nasa7 ora su2cor swm tomcatv 10 alvinn doduc eqntott espresso fpppp hydro2d Percent of Total Issue Cycles Observation Most hardware in an out of order CPU concerns physical registers Could several instruction threads share this hardware 100 From Tullsen Eggers and Levy Simultaneous Multithreading Maximizing Onchip Parallelism ISCA 1995 UC Regents Fall 2005 UCB Simultaneous Multi threading One thread 8 units Cycle M M FX FX FP FP BR CC Two threads 8 units Cycle M M FX FX FP FP BR CC 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 M Load Store FX Fixed Point FP Floating Point BR Branch CC Condition Codes CS 152 L19 Advanced Processors III UC Regents Fall 2005 UCB Branch redirects Power 4 Out of order processing Instruction fetch MP IF IC ISS RF BR EX D1 D2 D3 Xfer Instruction crack and group formation GD MP ISS RF EA MP ISS RF EX MP ISS RF DC Fmt FX Interrupts and flushes Xfer WB Xfer WB Xfer Power 5 Instruction fetch MP Load hitIFstore load that executes before an IC A younger BP older store to the same memory location has written its MP data to the caches must retrieve the data from the SDQ D0 D1 Xferwhether GD MP As loads execute they check theD2SRQD3to see there is any older store to the same memoryand location Group formation MP decode with data in the SDQ If one instruction is found the data is forwarded from the SDQ rather than from the cache If the data cannot be forwarded as is the case if the load and store instructions operate on overlapping memory Interrupts and flushes locations and the load data is not the same as or 2 fetch PC 2 initial decodes CS 152 L19 Advanced Processors III CP 2 commits Out of order processing architected register sets Branch POWER4 instruction execution pipeline WB FP F6 Branch Figure 4 redirects Xfer LD ST BP D0 WB ISS RF EX pipeline WB Xfer Load store pipeline Load hit load Two loads to the same memory location CP ISSmustRF EA theDC WB order Xfer and prevent observe memory Fmtreference a store to the memory location from another processor ISSbetween RF the EXintervening loads WB If the Xfer younger load Fixed point obtains old data the pipeline older load must not obtain ISSnew RF data This requirement is called sequential load consistency To guard against this LRQ entries for all F6 WB Xfer Floatingloads include a bit which if set indicates that a snoop has occurred to thepoint linepipeline containing the loaded data for that entry When a load instruction executes it UC Regents Fall 2005 UCB compute address DC data caches F6 six cycle floating point execution pipe Fmt data format WB write back and CP group commit Power 5 data flow Dynamic instruction selection Branch prediction Branch history tables Program counter Return stack Shared execution units Shared issue queues Target cache LSU0 Alternate Instruction cache Instruction translation FXU0 Instruction buffer 0 Data Cache LSU1 Group formation Instruction decode Dispatch Instruction buffer 1 Data Translation FXU1 FPU0 Group completion Store queue FPU1 BXU Thread priority

View Full Document

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 152 - Lecture 19 – Advanced Processors III

Sign up for free to view:

Please select your school