Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 18 Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California Berkeley http www eecs berkeley edu krste http inst cs berkeley edu cs152 Last Time Vector Computers Vectors provide efficient execution of data parallel loop codes Vector ISA provides compact encoding of machine parallelism ISAs scale to more lanes without changing binary code Vector registers provide fast temporary storage to reduce memory bandwidth demands simplify dependence checking between vector instructions Scatter gather masking compress expand operations increase set of vectorizable loops Requires extensive compiler analysis or programmer annotation to be certain that loops can be vectorized Full long vector support still only in supercomputers NEC SX9 Cray X1E microprocessors have limited short vector operations Intel x86 MMX SSE AVX IBM Motorola PowerPC VMX Altivec 4 14 2009 CS152 Spring 09 2 Multithreading Difficult to continue to extract instruction level parallelism ILP or data level parallelism DLP from a single sequential thread of control Many workloads can make use of thread level parallelism TLP TLP from multiprogramming run independent sequential jobs TLP from multithreaded applications run one job faster using parallel threads Multithreading uses TLP to improve utilization of a single processor 4 14 2009 CS152 Spring 09 3 Pipeline Hazards t0 t1 t2 t3 t4 t5 t6 t7 t8 LW r1 0 r2 LW r5 12 r1 ADDI r5 r5 12 SW 12 r1 r5 t9 t10 t11 t12 t13 t14 F D X MW F D D D D X MW F F F F D D D D X MW F F F F D D D D Each instruction may depend on the next What is usually done to cope with this 4 14 2009 CS152 Spring 09 4 Multithreading How can we guarantee no dependencies between instructions in a pipeline One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads T1 T4 on non bypassed 5 stage pipe t0 t1 t2 t3 t4 t5 t6 t7 F D X MW T1 LW r1 0 r2 F D X M T2 ADD r7 r1 r4 F D X T3 XORI r5 r4 12 T4 SW 0 r7 r5 F D T1 LW r5 12 r1 F 4 14 2009 t8 W MW X MW D X MW CS152 Spring 09 t9 Prior instruction in a thread always completes writeback before next instruction in same thread reads register file 5 CDC 6600 Peripheral Processors Cray 1964 First multithreaded hardware 10 virtual I O processors Fixed interleave on simple pipeline Pipeline has 100ns cycle time Each virtual processor executes one instruction every 1000ns Accumulator based instruction set to reduce processor state 4 14 2009 CS152 Spring 09 6 Simple Multithreaded Pipeline PC PC PC 1 PC 1 1 1 I IR GPR1 GPR1 GPR1 GPR1 X Y D 1 2 Thread select 2 Have to carry thread select down pipeline to ensure correct state bits read written at each pipe stage Appears to software including OS as multiple albeit slower CPUs 4 14 2009 CS152 Spring 09 7 Multithreading Costs Each thread requires its own user state PC GPRs Also needs its own system state virtual memory page table base register exception handling registers Other overheads Additional cache TLB conflicts from competing threads or add larger cache TLB capacity More OS overhead to schedule more threads where do all these threads come from 4 14 2009 CS152 Spring 09 8 Thread Scheduling Policies Fixed interleave CDC 6600 PPUs 1964 Each of N threads executes one instruction every N cycles If thread not ready to go in its slot insert pipeline bubble Software controlled interleave TI ASC PPUs 1971 OS allocates S pipeline slots amongst N threads Hardware performs fixed interleave over S slots executing whichever thread is in that slot Hardware controlled thread scheduling HEP 1982 Hardware keeps track of which threads are ready to go Picks next thread to execute based on hardware priority scheme 4 14 2009 CS152 Spring 09 9 Denelcor HEP Burton Smith 1982 First commercial machine to use hardware threading in main CPU 120 threads per processor 10 MHz clock rate Up to 8 processors precursor to Tera MTA Multithreaded Architecture 4 14 2009 CS152 Spring 09 10 Tera MTA 1990 97 Up to 256 processors Up to 128 active threads per processor Processors and memory modules populate a sparse 3D torus interconnection fabric Flat shared main memory No data cache Sustains one main memory access per cycle per processor GaAs logic in prototype 1KW processor 260MHz CMOS version MTA 2 50W processor 4 14 2009 CS152 Spring 09 11 MTA Pipeline Issue Pool Inst Fetch W Write Pool Memory Pool M A C W W Every cycle one VLIW instruction from one active thread is launched into pipeline Instruction pipeline is 21 cycles long Memory operations incur 150 cycles of latency Retry Pool Interconnection Network Memory pipeline 4 14 2009 Assuming a single thread issues one instruction every 21 cycles and clock rate is 260 MHz What is single thread performance CS152 Spring 09 12 Coarse Grain Multithreading Tera MTA designed for supercomputing applications with large data sets and low locality No data cache Many parallel threads needed to hide large memory latency Other applications are more cache friendly Few pipeline bubbles if cache mostly has hits Just add a few threads to hide occasional cache miss latencies Swap threads on cache misses 4 14 2009 CS152 Spring 09 13 MIT Alewife 1990 Modified SPARC chips register windows hold different thread contexts Up to four threads per node Thread switch on local cache miss 4 14 2009 CS152 Spring 09 14 IBM PowerPC RS64 IV 2000 Commercial coarse grain multithreading CPU Based on PowerPC with quad issue in order fivestage pipeline Each physical CPU supports two virtual CPUs On L2 cache miss pipeline is flushed and execution switches to second thread short pipeline minimizes flush penalty 4 cycles small compared to memory access latency flush pipeline to simplify exception handling 4 14 2009 CS152 Spring 09 15 CS152 Administrivia 4 14 2009 CS152 Spring 09 16 Simultaneous Multithreading SMT for OoO Superscalars Techniques presented so far have all been vertical multithreading where each pipeline stage works on one thread at a time SMT uses fine grain control already present inside an OoO superscalar to allow instructions from multiple threads to enter execution on same clock cycle Gives better utilization of machine resources 4 14 2009 CS152 Spring 09 17 For most apps most execution units lie idle in an OoO superscalar For an 8 way superscalar 4 14 2009 From Tullsen Eggers and Levy Simultaneous Multithreading Maximizing On chip Parallelism ISCA 1995 CS152 Spring 09 18 Superscalar


View Full Document

Berkeley COMPSCI 152 - Lecture 18: Multithreading

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Loading Unlocking...
Login

Join to view Lecture 18: Multithreading and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 18: Multithreading and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?