Berkeley COMPSCI 152 - Lecture 18: Multithreading - D2399608

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 152> Lecture 18: Multithreading

DOC PREVIEW

Berkeley COMPSCI 152 - Lecture 18: Multithreading

School name University of California, Berkeley

Course Compsci 152- Computer Architecture and Engineering

Pages 33

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 152 Computer Architecture and Engineering Lecture 18: MultithreadingLast Time: Vector ComputersMultithreadingPipeline HazardsSlide 5CDC 6600 Peripheral Processors (Cray, 1964)Simple Multithreaded PipelineMultithreading CostsThread Scheduling PoliciesDenelcor HEP (Burton Smith, 1982)Tera MTA (1990-97)MTA PipelineCoarse-Grain MultithreadingMIT Alewife (1990)IBM PowerPC RS64-IV (2000)CS152 AdministriviaSimultaneous Multithreading (SMT) for OoO SuperscalarsFor most apps, most execution units lie idle in an OoO superscalarSuperscalar Machine EfficiencyVertical MultithreadingChip Multiprocessing (CMP)Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]O-o-O Simultaneous Multithreading [Tullsen, Eggers, Emer, Levy, Stamm, Lo, DEC/UW, 1996]IBM Power 4PowerPoint PresentationPower 5 data flow ...Changes in Power 5 to support SMTPentium-4 Hyperthreading (2002)Initial Performance of SMTSMT adaptation to parallelism typeIcount Choosing PolicySummary: Multithreaded CategoriesAcknowledgementsCS 152 Computer Architectureand Engineering Lecture 18: Multithreading Krste AsanovicElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/~krstehttp://inst.cs.berkeley.edu/~cs1524/14/20092CS152-Spring’09Last Time: Vector Computers•Vectors provide efficient execution of data-parallel loop codes•Vector ISA provides compact encoding of machine parallelism•ISAs scale to more lanes without changing binary code•Vector registers provide fast temporary storage to reduce memory bandwidth demands, & simplify dependence checking between vector instructions•Scatter/gather, masking, compress/expand operations increase set of vectorizable loops•Requires extensive compiler analysis (or programmer annotation) to be certain that loops can be vectorized•Full “long” vector support still only in supercomputers (NEC SX9, Cray X1E); microprocessors have limited “short” vector operations–Intel x86 MMX/SSE/AVX–IBM/Motorola PowerPC VMX/Altivec4/14/20093CS152-Spring’09Multithreading•Difficult to continue to extract instruction-level parallelism (ILP) or data-level parallelism (DLP) from a single sequential thread of control•Many workloads can make use of thread-level parallelism (TLP)–TLP from multiprogramming (run independent sequential jobs)–TLP from multithreaded applications (run one job faster using parallel threads)•Multithreading uses TLP to improve utilization of a single processor4/14/20094CS152-Spring’09Pipeline Hazards•Each instruction may depend on the nextLW r1, 0(r2)LW r5, 12(r1)ADDI r5, r5, #12SW 12(r1), r5F D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8F D X M WD D DF D X M WD D DF F FF DD D DF F Ft9 t10 t11 t12 t13 t14What is usually done to cope with this?4/14/20095CS152-Spring’09MultithreadingHow can we guarantee no dependencies between instructions in a pipeline?-- One way is to interleave execution of instructions from different program threads on same pipelineF D X M Wt0 t1 t2 t3 t4 t5 t6 t7 t8T1: LW r1, 0(r2)T2: ADD r7, r1, r4T3: XORI r5, r4, #12T4: SW 0(r7), r5T1: LW r5, 12(r1)t9F D X M WF D X M WF D X M WF D X M WInterleave 4 threads, T1-T4, on non-bypassed 5-stage pipePrior instruction in a thread always completes write-back before next instruction in same thread reads register file4/14/20096CS152-Spring’09CDC 6600 Peripheral Processors(Cray, 1964)•First multithreaded hardware•10 “virtual” I/O processors•Fixed interleave on simple pipeline•Pipeline has 100ns cycle time•Each virtual processor executes one instruction every 1000ns•Accumulator-based instruction set to reduce processor state4/14/20097CS152-Spring’09Simple Multithreaded Pipeline•Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage•Appears to software (including OS) as multiple, albeit slower, CPUs+12Thread selectPC1PC1PC1PC1I$IRGPR1GPR1GPR1GPR1XY2D$4/14/20098CS152-Spring’09Multithreading Costs•Each thread requires its own user state– PC– GPRs•Also, needs its own system state–virtual memory page table base register–exception handling registers•Other overheads:–Additional cache/TLB conflicts from competing threads–(or add larger cache/TLB capacity)–More OS overhead to schedule more threads (where do all these threads come from?)4/14/20099CS152-Spring’09Thread Scheduling Policies•Fixed interleave (CDC 6600 PPUs, 1964)–Each of N threads executes one instruction every N cycles–If thread not ready to go in its slot, insert pipeline bubble•Software-controlled interleave (TI ASC PPUs, 1971)–OS allocates S pipeline slots amongst N threads–Hardware performs fixed interleave over S slots, executing whichever thread is in that slot•Hardware-controlled thread scheduling (HEP, 1982)–Hardware keeps track of which threads are ready to go–Picks next thread to execute based on hardware priority scheme4/14/200910CS152-Spring’09Denelcor HEP(Burton Smith, 1982)First commercial machine to use hardware threading in main CPU–120 threads per processor–10 MHz clock rate–Up to 8 processors–precursor to Tera MTA (Multithreaded Architecture)4/14/200911CS152-Spring’09Tera MTA (1990-97)•Up to 256 processors•Up to 128 active threads per processor•Processors and memory modules populate a sparse 3D torus interconnection fabric•Flat, shared main memory– No data cache– Sustains one main memory access per cycle per processor•GaAs logic in prototype, 1KW/processor @ 260MHz– CMOS version, MTA-2, 50W/processor4/14/200912CS152-Spring’09MTA PipelineAWCWMInst FetchMemory PoolRetry PoolInterconnection NetworkWrite PoolWMemory pipelineIssue Pool• Every cycle, one VLIW instruction from one active thread is launched into pipeline• Instruction pipeline is 21 cycles long• Memory operations incur ~150 cycles of latencyAssuming a single thread issues one instruction every 21 cycles, and clock rate is 260 MHz…What is single-thread performance?4/14/200913CS152-Spring’09Coarse-Grain MultithreadingTera MTA designed for supercomputing applications with large data sets and low locality–No data cache–Many parallel threads needed to hide large memory latencyOther applications are more cache friendly–Few pipeline bubbles if cache mostly has hits–Just add a few threads to hide occasional cache miss latencies–Swap threads on cache misses4/14/200914CS152-Spring’09MIT Alewife (1990)•Modified SPARC chips–register windows hold different

View Full Document

Berkeley COMPSCI 152 - Lecture 18: Multithreading

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Berkeley COMPSCI 152 - Lecture 18: Multithreading

Sign up for free to view:

Please select your school