Unformatted text preview:

CS 240A: Shared Memory & Multicore Programming with Cilk++Multicore Architecturecc-NUMA Architecturescc-NUMA ArchitecturesDesktop Multicores TodayMultithreaded ProgrammingConcurrency PlatformsCilk vs PThreadsCilk vs OpenMPComplexity MeasuresSeries CompositionParallel CompositionSpeedupSchedulingGreedy SchedulingGreedy SchedulingGreedy SchedulingAnalysis of GreedyOptimality of GreedyLinear SpeedupCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemCilk++ Runtime SystemGreat, how do we program it?Nested ParallelismCilk++ LoopsSerial CorrectnessSerializationParallel CorrectnessRace BugsRace BugsTypes of RacesAvoiding RacesCilk++ ReducersHyperobjects under the coversCilkscreenParallelismThree Tips on ParallelismThree Tips on Overheads1CS 240A: Shared Memory & Multicore Programming with Cilk++• Multicore and NUMA architectures• Multithreaded Programming• Cilk++ as a concurrency platform• Work and SpanThanks to Charles E. Leiserson for some of these slides2Multicore ArchitectureNetwork…Memory I/O$ $ $Chip Multiprocessor (CMP)core corecore3cc-NUMA ArchitecturesAMD 8-way Opteron Server ([email protected])A processor (CMP) with 2/4 cores Memory bank local to a processor Point-to-point interconnect4cc-NUMA Architectures∙ No Front Side Bus∙ Integrated memory controller ∙ On-die interconnect among CMPs ∙Main memory is physically distributedamong CMPs (i.e. each piece of memory has an affinity to a CMP)∙ NUMA: Non-uniform memory access. For multi-socket servers only  Your desktop is safe (well, for now at least) Triton nodes are not NUMA either5Desktop Multicores TodayThis is your AMD Barcelona or Intel Core i7 !On-die interconnect Private cache: Cache coherence is required6Multithreaded Programming∙ POSIX Threads (Pthreads) is a set of threading interfaces developed by the IEEE∙ “Assembly language” of shared memory programming∙ Programmer has to manually: Create and terminate threads Wait for threads to complete  Manage interaction between threads using mutexes, condition variables, etc.7Concurrency Platforms• Programming directly on PThreads is painfuland error-prone.• With PThreads, you either sacrifice memory usage or load-balance among processors • A concurrency platformprovides linguistic support and handles load balancing.• Examples:• Threading Building Blocks (TBB)• OpenMP• Cilk++8Cilk vs PThreadsHow will the following code execute in PThreads? In Cilk?for (i=1; i<1000000000; i++) {spawn-or-fork foo(i); }sync-or-join;What if foo contains code that waits (e.g., spins) on a variable being set by another instance of foo?They have different liveness properties:∙ Cilk threads are spawned lazily, “may” parallelism∙ PThreads are spawned eagerly, “must” parallelism9Cilk vs OpenMP∙ Cilk++ guarantees space bounds On P processors, Cilk++ uses no more than P times the stack space of a serial execution. ∙ Cilk++ has a solution for global variables (called “reducers” / “hyperobjects”) ∙ Cilk++ has nested parallelism that works and provides guaranteed speed-up.  Indeed, cilk scheduler is provably optimal.∙ Cilk++ has a race detector (cilkscreen) for debugging and software release. ∙Keep in mind that platform comparisons are (always will be) subject to debate10TP= execution time on P processorsT1= workT∞= span**Also called critical-path lengthor computational depth.WORK LAW∙TP≥T1/PSPAN LAW∙TP≥ T∞Complexity Measures11Work:T1(A∪B) =Series CompositionA BWork:T1(A∪B) = T1(A) + T1(B)Span:T∞(A∪B) = T∞(A) +T∞(B)Span:T∞(A∪B) =12Parallel CompositionABSpan:T∞(A∪B) = max{T∞(A), T∞(B)}Span:T∞(A∪B) =Work:T1(A∪B) =Work:T1(A∪B) = T1(A) + T1(B)13Def. T1/TP= speedupon P processors.If T1/TP= Θ(P), we have linear speedup,= P, we have perfect linear speedup,> P, we have superlinear speedup, which is not possible in this performance model, because of the Work Law TP≥ T1/P.Speedup14Scheduling∙Cilk++ allows the programmer to express potentialparallelism in an application.∙The Cilk++schedulermaps strands onto processors dynamically at runtime.∙Since on-lineschedulers are complicated, we’ll explore the ideas with an off-linescheduler.Network…Memory I/OPPP P$ $ $A strand is a sequence of instructions that doesn’t contain any parallel constructs15Greedy SchedulingIDEA: Do as much as possible on every step.Definition:A strand isreadyif all its predecessors have executed.16Greedy SchedulingIDEA: Do as much as possible on every step.Definition:A strand isreadyif all its predecessors have executed.Complete step∙ ≥ P strands ready.∙ Run any P.P = 317Greedy SchedulingIDEA: Do as much as possible on every step.Definition:A strand isreadyif all its predecessors have executed.Complete step∙ ≥ P strands ready.∙ Run any P.P = 3Incomplete step∙ < P strands ready.∙ Run all of them.18Theorem: Any greedy scheduler achievesTP≤ T1/P + T∞.Analysis of GreedyProof. ∙ # complete steps ≤ T1/P, since each complete step performs P work.∙# incomplete steps ≤ T∞, since each incomplete step reduces the span of the unexecuted dagby 1. ■P = 319Optimality of GreedyCorollary.Any greedy scheduler achieves within a factor of 2 of optimal.Proof. Let TP* be the execution time produced by the optimal scheduler. Since TP* ≥ max{T1/P, T∞} by the Work and Span Laws, we haveTP≤ T1/P + T∞ ≤ 2⋅max{T1/P, T∞}≤ 2TP* . ■20Linear SpeedupCorollary.Any greedy scheduler achieves near-perfect linear speedup whenever P ≪ T1/T∞. Proof. Since P ≪ T1/T∞is equivalent to T∞≪ T1/P, the Greedy Scheduling Theorem gives usTP≤ T1/P + T∞≈ T1/P .Thus, the speedup is T1/TP≈ P. ■Definition.The quantity T1/PT∞ is called the parallel slackness.21Each worker (processor) maintains awork dequeof ready strands, and it manipulates the bottom of the deque like a stackPspawncallcallcallPspawnspawnPPcallspawncallspawncallcallCall!Cilk++ Runtime System22PspawncallcallcallspawnPspawnspawnPPcallspawncallspawncallcallSpawn!Each worker (processor) maintains awork dequeof ready strands, and it manipulates the bottom of the deque like a stackCilk++ Runtime System23PspawncallcallcallspawnspawnPspawnspawnPPcallspawncallcallspawncallspawncallSpawn!Spawn! Call!Each worker (processor) maintains awork dequeof ready


View Full Document

UCSB CS 240A - Multicore Programming

Download Multicore Programming
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Multicore Programming and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Multicore Programming 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?