DOC PREVIEW
Berkeley COMPSCI 152 - Lecture Notes

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

March 16, 2011 CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 14: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.cs.berkeley.edu/~cs152March 16, 2011 CS152, Spring 2011 2 Last Time Lecture 13: VLIW • In a classic VLIW, compiler is responsible for avoiding all hazards -> simple hardware, complex compiler. Later VLIWs added more dynamic hardware interlocks • Use loop unrolling and software pipelining for loops, trace scheduling for more irregular code • Static scheduling difficult in presence of unpredictable branches and variable latency memoryMarch 16, 2011 CS152, Spring 2011 3 Multithreading • Difficult to continue to extract instruction-level parallelism (ILP) from a single sequential thread of control • Many workloads can make use of thread-level parallelism (TLP) – TLP from multiprogramming (run independent sequential jobs) – TLP from multithreaded applications (run one job faster using parallel threads) • Multithreading uses TLP to improve utilization of a single processorMarch 16, 2011 CS152, Spring 2011 4 Pipeline Hazards • Each instruction may depend on the next LW r1, 0(r2) LW r5, 12(r1) ADDI r5, r5, #12 SW 12(r1), r5 F D X M W t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W D D D F D X M W D D D F F F F D D D D F F F t9 t10 t11 t12 t13 t14 What is usually done to cope with this? – interlocks (slow) – or bypassing (needs hardware, doesn’t help all hazards)March 16, 2011 CS152, Spring 2011 5 Multithreading How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline F D X M W t0 t1 t2 t3 t4 t5 t6 t7 t8 T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t9 F D X M W F D X M W F D X M W F D X M W Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe Prior instruction in a thread always completes write-back before next instruction in same thread reads register fileMarch 16, 2011 CS152, Spring 2011 6 CDC 6600 Peripheral Processors (Cray, 1964) • First multithreaded hardware • 10 “virtual” I/O processors • Fixed interleave on simple pipeline • Pipeline has 100ns cycle time • Each virtual processor executes one instruction every 1000ns • Accumulator-based instruction set to reduce processor stateMarch 16, 2011 CS152, Spring 2011 7 Simple Multithreaded Pipeline • Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage • Appears to software (including OS) as multiple, albeit slower, CPUs +1 2 Thread select PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 GPR1 GPR1 GPR1 X Y 2 D$March 16, 2011 CS152, Spring 2011 8 Multithreading Costs • Each thread requires its own user state – PC – GPRs • Also, needs its own system state – virtual memory page table base register – exception handling registers • Other overheads: – Additional cache/TLB conflicts from competing threads – (or add larger cache/TLB capacity) – More OS overhead to schedule more threads (where do all these threads come from?)March 16, 2011 CS152, Spring 2011 9 Thread Scheduling Policies • Fixed interleave (CDC 6600 PPUs, 1964) – Each of N threads executes one instruction every N cycles – If thread not ready to go in its slot, insert pipeline bubble • Software-controlled interleave (TI ASC PPUs, 1971) – OS allocates S pipeline slots amongst N threads – Hardware performs fixed interleave over S slots, executing whichever thread is in that slot • Hardware-controlled thread scheduling (HEP, 1982) – Hardware keeps track of which threads are ready to go – Picks next thread to execute based on hardware priority schemeMarch 16, 2011 CS152, Spring 2011 10 Denelcor HEP (Burton Smith, 1982) First commercial machine to use hardware threading in main CPU – 120 threads per processor – 10 MHz clock rate – Up to 8 processors – precursor to Tera MTA (Multithreaded Architecture)March 16, 2011 CS152, Spring 2011 11 Tera MTA (1990-) • Up to 256 processors • Up to 128 active threads per processor • Processors and memory modules populate a sparse 3D torus interconnection fabric • Flat, shared main memory – No data cache – Sustains one main memory access per cycle per processor • GaAs logic in prototype, 1KW/processor @ 260MHz – Second version CMOS, MTA-2, 50W/processor – New version, XMT, fits into AMD Opteron socket, runs at 500MHzMarch 16, 2011 CS152, Spring 2011 12 MTA Pipeline A W C W M Inst Fetch Memory Pool Retry Pool Interconnection Network Write Pool W Memory pipeline Issue Pool • Every cycle, one VLIW instruction from one active thread is launched into pipeline • Instruction pipeline is 21 cycles long • Memory operations incur ~150 cycles of latency Assuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz… What is single-thread performance? Effective single-thread issue rate is 260/21 = 12.4 MIPSMarch 16, 2011 CS152, Spring 2011 13 Coarse-Grain Multithreading Tera MTA designed for supercomputing applications with large data sets and low locality – No data cache – Many parallel threads needed to hide large memory latency Other applications are more cache friendly – Few pipeline bubbles if cache mostly has hits – Just add a few threads to hide occasional cache miss latencies – Swap threads on cache missesMarch 16, 2011 CS152, Spring 2011 14 MIT Alewife (1990) • Modified SPARC chips – register windows hold different thread contexts • Up to four threads per node • Thread switch on local cache missMarch 16, 2011 CS152, Spring 2011 15 IBM PowerPC RS64-IV (2000) • Commercial coarse-grain multithreading CPU • Based on PowerPC with quad-issue in-order five-stage pipeline • Each physical CPU supports two virtual CPUs • On L2 cache miss, pipeline is flushed and execution switches to second thread – short pipeline minimizes flush penalty (4 cycles), small compared to memory access latency – flush pipeline to simplify exception handlingMarch 16, 2011 CS152, Spring 2011 Oracle/Sun Niagara


View Full Document

Berkeley COMPSCI 152 - Lecture Notes

Documents in this Course
Quiz 5

Quiz 5

9 pages

Memory

Memory

29 pages

Quiz 5

Quiz 5

15 pages

Memory

Memory

29 pages

Memory

Memory

35 pages

Memory

Memory

15 pages

Quiz

Quiz

6 pages

Midterm 1

Midterm 1

20 pages

Quiz

Quiz

12 pages

Memory

Memory

33 pages

Quiz

Quiz

6 pages

Homework

Homework

19 pages

Quiz

Quiz

5 pages

Memory

Memory

15 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?