Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 14 Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California Berkeley http www eecs berkeley edu krste http inst cs berkeley edu cs152 March 16 2011 CS152 Spring 2011 Last Time Lecture 13 VLIW In a classic VLIW compiler is responsible for avoiding all hazards simple hardware complex compiler Later VLIWs added more dynamic hardware interlocks Use loop unrolling and software pipelining for loops trace scheduling for more irregular code Static scheduling difficult in presence of unpredictable branches and variable latency memory March 16 2011 CS152 Spring 2011 2 Multithreading Difficult to continue to extract instruction level parallelism ILP from a single sequential thread of control Many workloads can make use of thread level parallelism TLP TLP from multiprogramming run independent sequential jobs TLP from multithreaded applications run one job faster using parallel threads Multithreading uses TLP to improve utilization of a single processor March 16 2011 CS152 Spring 2011 3 Pipeline Hazards t0 t1 t2 t3 t4 t5 t6 t7 t8 LW r1 0 r2 LW r5 12 r1 ADDI r5 r5 12 SW 12 r1 r5 t9 t10 t11 t12 t13 t14 F D X MW F D D D D X MW F F F F D D D D X MW F F F F D D D D Each instruction may depend on the next What is usually done to cope with this interlocks slow or bypassing needs hardware doesn t help all hazards March 16 2011 CS152 Spring 2011 4 Multithreading How can we guarantee no dependencies between instructions in a pipeline One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads T1 T4 on non bypassed 5 stage pipe t0 t1 t2 t3 t4 t5 t6 t7 F D X MW T1 LW r1 0 r2 F D X M T2 ADD r7 r1 r4 F D X T3 XORI r5 r4 12 T4 SW 0 r7 r5 F D T1 LW r5 12 r1 F March 16 2011 t8 W MW X MW D X MW CS152 Spring 2011 t9 Prior instruction in a thread always completes writeback before next instruction in same thread reads register file 5 CDC 6600 Peripheral Processors Cray 1964 First multithreaded hardware 10 virtual I O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator based instruction set to reduce processor state March 16 2011 CS152 Spring 2011 6 Simple Multithreaded Pipeline PC PC PC 1 PC 1 1 1 I IR GPR1 GPR1 GPR1 GPR1 X Y D 1 2 Thread select 2 Have to carry thread select down pipeline to ensure correct state bits read written at each pipe stage Appears to software including OS as multiple albeit slower CPUs March 16 2011 CS152 Spring 2011 7 Multithreading Costs Each thread requires its own user state PC GPRs Also needs its own system state virtual memory page table base register exception handling registers Other overheads Additional cache TLB conflicts from competing threads or add larger cache TLB capacity More OS overhead to schedule more threads where do all these threads come from March 16 2011 CS152 Spring 2011 8 Thread Scheduling Policies Fixed interleave CDC 6600 PPUs 1964 Each of N threads executes one instruction every N cycles If thread not ready to go in its slot insert pipeline bubble Software controlled interleave TI ASC PPUs 1971 OS allocates S pipeline slots amongst N threads Hardware performs fixed interleave over S slots executing whichever thread is in that slot Hardware controlled thread scheduling HEP 1982 Hardware keeps track of which threads are ready to go Picks next thread to execute based on hardware priority scheme March 16 2011 CS152 Spring 2011 9 Denelcor HEP Burton Smith 1982 First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA Multithreaded Architecture March 16 2011 CS152 Spring 2011 10 Tera MTA 1990 Up to 256 processors Up to 128 active threads per processor Processors and memory modules populate a sparse 3D torus interconnection fabric Flat shared main memory No data cache Sustains one main memory access per cycle per processor GaAs logic in prototype 1KW processor 260MHz Second version CMOS MTA 2 50W processor New version XMT fits into AMD Opteron socket runs at 500MHz March 16 2011 CS152 Spring 2011 11 MTA Pipeline Issue Pool Inst Fetch W Write Pool Memory Pool M A C W W Every cycle one VLIW instruction from one active thread is launched into pipeline Instruction pipeline is 21 cycles long Memory operations incur 150 cycles of latency Retry Pool Interconnection Network Memory pipeline Assuming a single thread issues one instruction every 21 cycles and clock rate is 260 MHz What is single thread performance Effective single thread issue rate is 260 21 12 4 MIPS March 16 2011 CS152 Spring 2011 12 Coarse Grain Multithreading Tera MTA designed for supercomputing applications with large data sets and low locality No data cache Many parallel threads needed to hide large memory latency Other applications are more cache friendly Few pipeline bubbles if cache mostly has hits Just add a few threads to hide occasional cache miss latencies Swap threads on cache misses March 16 2011 CS152 Spring 2011 13 MIT Alewife 1990 Modified SPARC chips register windows hold different thread contexts Up to four threads per node Thread switch on local cache miss March 16 2011 CS152 Spring 2011 14 IBM PowerPC RS64 IV 2000 Commercial coarse grain multithreading CPU Based on PowerPC with quad issue in order fivestage pipeline Each physical CPU supports two virtual CPUs On L2 cache miss pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty 4 cycles small compared to memory access latency flush pipeline to simplify exception handling March 16 2011 CS152 Spring 2011 15 Oracle Sun Niagara processors Target is datacenters running web servers and databases with many concurrent requests Provide multiple simple cores each with multiple hardware threads reduced energy operation though much lower single thread performance Niagara 1 2004 8 cores 4 threads core Niagara 2 2007 8 cores 8 threads core Niagara 3 2009 16 cores 8 threads core March 16 2011 CS152 Spring 2011 16 Oracle Sun Niagara 3 Rainbow Falls 2009 March 16 2011 CS152 Spring 2011 17 CS152 Administrivia Quiz 3 Monday March 28 first class back after spring break All material on complex pipelining and branch prediction L10 L12 plus review at start of L13 PS3 Lab 3 Pick up hard copy draft of Hennessey Patterson 5th edition chapter on Vector Processing GPUs Read before


View Full Document

Berkeley COMPSCI 152 - Lecture Notes

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?