UCSB CS 240A - Multicore Architecture - D2448345

Home> Schools> University of California, Santa Barbara> (CS) > CS 240A> Multicore Architecture

DOC PREVIEW

UCSB CS 240A - Multicore Architecture

School name University of California, Santa Barbara

Course Cs 240a- Applied Parallel Computing

Pages 47

This preview shows page 1-2-3-22-23-24-45-46-47 out of 47 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Multicore Architecturecc-NUMA Architecturescc-NUMA ArchitecturesDesktop Multicores TodayMultithreaded ProgrammingConcurrency PlatformsCilk vs PThreadsCilk vs OpenMPComplexity MeasuresSeries CompositionParallel CompositionSpeedupSchedulingGreedy SchedulingGreedy SchedulingGreedy SchedulingAnalysis of GreedyOptimality of GreedyLinear SpeedupCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemGreat, how do we program it?Nested ParallelismCilk++ LoopsSerial CorrectnessSerializationParallel CorrectnessRace BugsRace BugsTypes of RacesAvoiding RacesCilk++ ReducersHyperobjects under the coversCilkscreenParallelismThree Tips on ParallelismThree Tips on Overheads1CS 240A: Shared Memory & Multicore Programming with Cilk++• Multicore and NUMA architectures• Multithreaded Programming• Cilk++ as a concurrency platform• Work and SpanThanks to Charles E. Leiserson for some of these slides2Multicore ArchitectureNetworkNetwork…MemoryMemoryI/OI/O$ $ $Chip Multiprocessor (CMP)core corecore3cc-NUMA ArchitecturesAMD 8-way Opteron Server ([email protected])A processor (CMP) with 2/4 cores A processor (CMP) with 2/4 cores Memory bank local to a processor Memory bank local to a processor Point-to-point interconnect Point-to-point interconnect4cc-NUMA Architectures∙No Front Side Bus∙Integrated memory controller ∙On-die interconnect among CMPs ∙Main memory is physically distributed among CMPs (i.e. each piece of memory has an affinity to a CMP)∙NUMA: Non-uniform memory access.For multi-socket servers only Your desktop is safe (well, for now at least)Triton nodes are not NUMA either5Desktop Multicores TodayThis is your AMD Barcelona or Intel Core i7 !On-die interconnect On-die interconnect Private cache: Cache coherence is required Private cache: Cache coherence is required6Multithreaded Programming∙POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE∙“Assembly language” of shared memory programming∙Programmer has to manually:Create and terminate threadsWait for threads to complete Manage interaction between threads using mutexes, condition variables, etc.7Concurrency Platforms•Programming directly on PThreads is painful and error-prone.•With PThreads, you either sacrifice memory usage or load-balance among processors •A concurrency platform provides linguistic support and handles load balancing.•Examples:• Threading Building Blocks (TBB)• OpenMP• Cilk++8Cilk vs PThreadsHow will the following code execute in PThreads? In Cilk?for (i=1; i<1000000000; i++) { spawn-or-fork foo(i); }sync-or-join;for (i=1; i<1000000000; i++) { spawn-or-fork foo(i); }sync-or-join;What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo?They have different liveness properties:∙Cilk threads are spawned lazily, “may” parallelism∙PThreads are spawned eagerly, “must” parallelism9Cilk vs OpenMP∙Cilk++ guarantees space boundsOn P processors, Cilk++ uses no more than P times the stack space of a serial execution. ∙Cilk++ has a solution for global variables (called “reducers” / “hyperobjects”) ∙Cilk++ has nested parallelism that works and provides guaranteed speed-up. Indeed, cilk scheduler is provably optimal.∙Cilk++ has a race detector (cilkscreen) for debugging and software release. ∙Keep in mind that platform comparisons are (always will be) subject to debate10TP = execution time on P processorsT1 = work T∞ = span**Also called critical-path lengthor computational depth.WORK LAW∙TP ≥T1/PSPAN LAW∙TP ≥ T∞Complexity Measures11Work: T1(A∪B) =Series CompositionAABBWork: T1(A∪B) = T1(A) + T1(B)Span: T∞(A∪B) = T∞(A) +T∞(B)Span: T∞(A∪B) =12Parallel CompositionAABBSpan: T∞(A∪B) = max{T∞(A), T∞(B)}Span: T∞(A∪B) =Work: T1(A∪B) =Work: T1(A∪B) = T1(A) + T1(B)13Def. T1/TP = speedup on P processors.If T1/TP= (P), we have linear speedup,= P, we have perfect linear speedup,> P, we have superlinear speedup, which is not possible in this performance model, because of the Work Law TP ≥ T1/P.Speedup14Scheduling∙Cilk++ allows the programmer to express potential parallelism in an application.∙The Cilk++ scheduler maps strands onto processors dynamically at runtime.∙Since on-line schedulers are complicated, we’ll explore the ideas with an off-line scheduler.NetworkNetwork…MemoryMemoryI/OI/OPPPPPPPP$ $ $A strand ,is a sequence of instructions that doesn’t contain any parallel constructsA strand ,is a sequence of instructions that doesn’t contain any parallel constructs15Greedy SchedulingIDEA: Do as much as possible on every step.Definition: A strand is ready if all its predecessors have executed.16Greedy SchedulingIDEA: Do as much as possible on every step.Definition: A strand is ready if all its predecessors have executed.Complete step ∙≥ P strands ready.∙Run any P.P = 317Greedy SchedulingIDEA: Do as much as possible on every step.Definition: A strand is ready if all its predecessors have executed.Complete step ∙≥ P strands ready.∙Run any P.P = 3Incomplete step ∙< P strands ready.∙Run all of them.18Theorem : Any greedy scheduler achievesTP  T1/P + T∞.Analysis of GreedyProof. ∙# complete steps  T1/P, since each complete step performs P work.∙# incomplete steps  T∞, since each incomplete step reduces the span of the unexecuted dag by 1. ■P = 319Optimality of GreedyCorollary. Any greedy scheduler achieves within a factor of 2 of optimal.Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ≥ max{T1/P, T∞} by the Work and Span Laws, we haveTP≤ T1/P + T∞ ≤ 2⋅max{T1/P, T∞}≤ 2TP* . ■20Linear SpeedupCorollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ≪ T1/T∞. Proof. Since P ≪ T1/T∞ is equivalent to T∞ ≪ T1/P, the Greedy Scheduling Theorem gives us TP≤ T1/P + T∞≈ T1/P .Thus, the speedup is T1/TP ≈ P. ■Definition. The quantity T1/PT∞ is called the parallel slackness.21Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a

View Full Document