CS 240A: Shared Memory & Multicore Programming with Cilk++Multicore Architecturecc-NUMA Architecturescc-NUMA ArchitecturesDesktop Multicores TodayMultithreaded ProgrammingConcurrency PlatformsCilk vs PThreadsCilk vs OpenMPComplexity MeasuresSeries CompositionParallel CompositionSpeedupSchedulingGreedy SchedulingGreedy SchedulingGreedy SchedulingAnalysis of GreedyOptimality of GreedyLinear SpeedupCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemGreat, how do we program it?Nested ParallelismCilk++ LoopsSerial CorrectnessSerializationParallel CorrectnessRace BugsRace BugsTypes of RacesAvoiding RacesCilk++ ReducersHyperobjects under the coversCilkscreenParallelismThree Tips on ParallelismThree Tips on Overheads1CS 240A: Shared Memory & Multicore Programming with Cilk++• Multicore and NUMA architectures• Multithreaded Programming• Cilk++ as a concurrency platform• Work and SpanThanks to Charles E. Leiserson for some of these slides2Multicore ArchitectureNetwork…Memory I/O$ $ $Chip Multiprocessor (CMP)core corecore3cc-NUMA ArchitecturesAMD 8-way Opteron Server ([email protected])A processor (CMP) with 2/4 cores Memory bank local to a processor Point-to-point interconnect4cc-NUMA Architectures∙ No Front Side Bus∙ Integrated memory controller ∙ On-die interconnect among CMPs ∙Main memory is physically distributedamong CMPs (i.e. each piece of memory has an affinity to a CMP)∙ NUMA: Non-uniform memory access. For multi-socket servers only Your desktop is safe (well, for now at least) Triton nodes are not NUMA either5Desktop Multicores TodayThis is your AMD Barcelona or Intel Core i7 !On-die interconnect Private cache: Cache coherence is required6Multithreaded Programming∙ POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE∙ “Assembly language” of shared memory programming∙ Programmer has to manually: Create and terminate threads Wait for threads to complete Manage interaction between threads using mutexes, condition variables, etc.7Concurrency Platforms• Programming directly on PThreads is painfuland error-prone.• With PThreads, you either sacrifice memory usage or load-balance among processors • A concurrency platformprovides linguistic support and handles load balancing.• Examples:• Threading Building Blocks (TBB)• OpenMP• Cilk++8Cilk vs PThreadsHow will the following code execute in PThreads? In Cilk?for (i=1; i<1000000000; i++) {spawn-or-fork foo(i); }sync-or-join;What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo?They have different liveness properties:∙ Cilk threads are spawned lazily, “may” parallelism∙ PThreads are spawned eagerly, “must” parallelism9Cilk vs OpenMP∙ Cilk++ guarantees space bounds On P processors, Cilk++ uses no more than P times the stack space of a serial execution. ∙ Cilk++ has a solution for global variables (called “reducers” / “hyperobjects”) ∙ Cilk++ has nested parallelism that works and provides guaranteed speed-up. Indeed, cilk scheduler is provably optimal.∙ Cilk++ has a race detector (cilkscreen) for debugging and software release. ∙Keep in mind that platform comparisons are (always will be) subject to debate10TP= execution time on P processorsT1= workT∞= span**Also called critical-path lengthor computational depth.WORK LAW∙TP≥T1/PSPAN LAW∙TP≥ T∞Complexity Measures11Work:T1(A∪B) =Series CompositionA BWork:T1(A∪B) = T1(A) + T1(B)Span:T∞(A∪B) = T∞(A) +T∞(B)Span:T∞(A∪B) =12Parallel CompositionABSpan:T∞(A∪B) = max{T∞(A), T∞(B)}Span:T∞(A∪B) =Work:T1(A∪B) =Work:T1(A∪B) = T1(A) + T1(B)13Def. T1/TP= speedupon P processors.If T1/TP= Θ(P), we have linear speedup,= P, we have perfect linear speedup,> P, we have superlinear speedup, which is not possible in this performance model, because of the Work Law TP≥ T1/P.Speedup14Scheduling∙Cilk++ allows the programmer to express potentialparallelism in an application.∙The Cilk++schedulermaps strands onto processors dynamically at runtime.∙Since on-lineschedulers are complicated, we’ll explore the ideas with an off-linescheduler.Network…Memory I/OPPP P$ $ $A strand is a sequence of instructions that doesn’t contain any parallel constructs15Greedy SchedulingIDEA: Do as much as possible on every step.Definition:A strand isreadyif all its predecessors have executed.16Greedy SchedulingIDEA: Do as much as possible on every step.Definition:A strand isreadyif all its predecessors have executed.Complete step∙ ≥ P strands ready.∙ Run any P.P = 317Greedy SchedulingIDEA: Do as much as possible on every step.Definition:A strand isreadyif all its predecessors have executed.Complete step∙ ≥ P strands ready.∙ Run any P.P = 3Incomplete step∙ < P strands ready.∙ Run all of them.18Theorem: Any greedy scheduler achievesTP≤ T1/P + T∞.Analysis of GreedyProof. ∙ # complete steps ≤ T1/P, since each complete step performs P work.∙# incomplete steps ≤ T∞, since each incomplete step reduces the span of the unexecuted dagby 1. ■P = 319Optimality of GreedyCorollary.Any greedy scheduler achieves within a factor of 2 of optimal.Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ≥ max{T1/P, T∞} by the Work and Span Laws, we haveTP≤ T1/P + T∞ ≤ 2⋅max{T1/P, T∞}≤ 2TP* . ■20Linear SpeedupCorollary.Any greedy scheduler achieves near-perfect linear speedup whenever P ≪ T1/T∞. Proof. Since P ≪ T1/T∞is equivalent to T∞≪ T1/P, the Greedy Scheduling Theorem gives usTP≤ T1/P + T∞≈ T1/P .Thus, the speedup is T1/TP≈ P. ■Definition.The quantity T1/PT∞ is called the parallel slackness.21Each worker (processor) maintains awork dequeof ready strands, and it manipulates the bottom of the deque like a stackPspawncallcallcallPspawnspawnPPcallspawncallspawncallcallCall!Cilk++ Runtime System22PspawncallcallcallspawnPspawnspawnPPcallspawncallspawncallcallSpawn!Each worker (processor) maintains awork dequeof ready strands, and it manipulates the bottom of the deque like a stackCilk++ Runtime System23PspawncallcallcallspawnspawnPspawnspawnPPcallspawncallcallspawncallspawncallSpawn!Spawn! Call!Each worker (processor) maintains awork dequeof ready
View Full Document