Slide 1Multicore Architecturecc-NUMA Architecturescc-NUMA ArchitecturesDesktop Multicores TodayMultithreaded ProgrammingConcurrency PlatformsCilk vs PThreadsCilk vs OpenMPComplexity MeasuresSeries CompositionParallel CompositionSpeedupSchedulingGreedy SchedulingGreedy SchedulingGreedy SchedulingAnalysis of GreedyOptimality of GreedyLinear SpeedupCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemGreat, how do we program it?Nested ParallelismCilk++ LoopsSerial CorrectnessSerializationParallel CorrectnessRace BugsRace BugsTypes of RacesAvoiding RacesCilk++ ReducersHyperobjects under the coversCilkscreenParallelismThree Tips on ParallelismThree Tips on Overheads1CS 240A: Shared Memory & Multicore Programming with Cilk++• Multicore and NUMA architectures• Multithreaded Programming• Cilk++ as a concurrency platform• Work and SpanThanks to Charles E. Leiserson for some of these slides2Multicore ArchitectureNetworkNetwork…MemoryMemoryI/OI/O$ $ $Chip Multiprocessor (CMP)core corecore3cc-NUMA ArchitecturesAMD 8-way Opteron Server ([email protected])A processor (CMP) with 2/4 cores A processor (CMP) with 2/4 cores Memory bank local to a processor Memory bank local to a processor Point-to-point interconnect Point-to-point interconnect4cc-NUMA Architectures∙No Front Side Bus∙Integrated memory controller ∙On-die interconnect among CMPs ∙Main memory is physically distributed among CMPs (i.e. each piece of memory has an affinity to a CMP)∙NUMA: Non-uniform memory access.For multi-socket servers only Your desktop is safe (well, for now at least)Triton nodes are not NUMA either5Desktop Multicores TodayThis is your AMD Barcelona or Intel Core i7 !On-die interconnect On-die interconnect Private cache: Cache coherence is required Private cache: Cache coherence is required6Multithreaded Programming∙POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE∙“Assembly language” of shared memory programming∙Programmer has to manually:Create and terminate threadsWait for threads to complete Manage interaction between threads using mutexes, condition variables, etc.7Concurrency Platforms•Programming directly on PThreads is painful and error-prone.•With PThreads, you either sacrifice memory usage or load-balance among processors •A concurrency platform provides linguistic support and handles load balancing.•Examples:• Threading Building Blocks (TBB)• OpenMP• Cilk++8Cilk vs PThreadsHow will the following code execute in PThreads? In Cilk?for (i=1; i<1000000000; i++) { spawn-or-fork foo(i); }sync-or-join;for (i=1; i<1000000000; i++) { spawn-or-fork foo(i); }sync-or-join;What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo?They have different liveness properties:∙Cilk threads are spawned lazily, “may” parallelism∙PThreads are spawned eagerly, “must” parallelism9Cilk vs OpenMP∙Cilk++ guarantees space boundsOn P processors, Cilk++ uses no more than P times the stack space of a serial execution. ∙Cilk++ has a solution for global variables (called “reducers” / “hyperobjects”) ∙Cilk++ has nested parallelism that works and provides guaranteed speed-up. Indeed, cilk scheduler is provably optimal.∙Cilk++ has a race detector (cilkscreen) for debugging and software release. ∙Keep in mind that platform comparisons are (always will be) subject to debate10TP = execution time on P processorsT1 = work T∞ = span**Also called critical-path lengthor computational depth.WORK LAW∙TP ≥T1/PSPAN LAW∙TP ≥ T∞Complexity Measures11Work: T1(A∪B) =Series CompositionAABBWork: T1(A∪B) = T1(A) + T1(B)Span: T∞(A∪B) = T∞(A) +T∞(B)Span: T∞(A∪B) =12Parallel CompositionAABBSpan: T∞(A∪B) = max{T∞(A), T∞(B)}Span: T∞(A∪B) =Work: T1(A∪B) =Work: T1(A∪B) = T1(A) + T1(B)13Def. T1/TP = speedup on P processors.If T1/TP= (P), we have linear speedup,= P, we have perfect linear speedup,> P, we have superlinear speedup, which is not possible in this performance model, because of the Work Law TP ≥ T1/P.Speedup14Scheduling∙Cilk++ allows the programmer to express potential parallelism in an application.∙The Cilk++ scheduler maps strands onto processors dynamically at runtime.∙Since on-line schedulers are complicated, we’ll explore the ideas with an off-line scheduler.NetworkNetwork…MemoryMemoryI/OI/OPPPPPPPP$ $ $A strand ,is a sequence of instructions that doesn’t contain any parallel constructsA strand ,is a sequence of instructions that doesn’t contain any parallel constructs15Greedy SchedulingIDEA: Do as much as possible on every step.Definition: A strand is ready if all its predecessors have executed.16Greedy SchedulingIDEA: Do as much as possible on every step.Definition: A strand is ready if all its predecessors have executed.Complete step ∙≥ P strands ready.∙Run any P.P = 317Greedy SchedulingIDEA: Do as much as possible on every step.Definition: A strand is ready if all its predecessors have executed.Complete step ∙≥ P strands ready.∙Run any P.P = 3Incomplete step ∙< P strands ready.∙Run all of them.18Theorem : Any greedy scheduler achievesTP T1/P + T∞.Analysis of GreedyProof. ∙# complete steps T1/P, since each complete step performs P work.∙# incomplete steps T∞, since each incomplete step reduces the span of the unexecuted dag by 1. ■P = 319Optimality of GreedyCorollary. Any greedy scheduler achieves within a factor of 2 of optimal.Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ≥ max{T1/P, T∞} by the Work and Span Laws, we haveTP≤ T1/P + T∞ ≤ 2⋅max{T1/P, T∞}≤ 2TP* . ■20Linear SpeedupCorollary. Any greedy scheduler achieves near-perfect linear speedup whenever P ≪ T1/T∞. Proof. Since P ≪ T1/T∞ is equivalent to T∞ ≪ T1/P, the Greedy Scheduling Theorem gives us TP≤ T1/P + T∞≈ T1/P .Thus, the speedup is T1/TP ≈ P. ■Definition. The quantity T1/PT∞ is called the parallel slackness.21Each worker (processor) maintains a work deque of ready strands, and it manipulates the bottom of the deque like a
View Full Document