Dr. Bradley Kuszmaul, MIT. 6.189 IAP 2007 MIT6.189 IAP 2007Lecture 15Cilk© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 1Design and Analysis of Dynamic Multithreaded AlgorithmsBradley C. KuszmaulMIT Computer Science and Artificial Intelligence Laboratory© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 2Shared-Memory Multiprocessor•Symmetric multiprocessor (SMP)Symmetric multiprocessor (SMP)•Cache-coherent nonuniform memory Cache-coherent nonuniform memory architecture (CC-NUMA)architecture (CC-NUMA)P P PShared Network…Memory I/O$ $ $© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 3A C language for dynamic multithreading with a provably good runtime system.CilkPlatforms•Sun UltraSPARC Enterprise•SGI Origin 2000•Compaq/Digital Alphaserver•Intel Pentium SMP’sApplications•virus shell assembly•graphics rendering•n-body simulationSocrates and CilkchessCilk automatically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 4Fibonacciint fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}C elisioncilkcilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawnspawn fib(n-1); y = spawnspawn fib(n-2); sync;sync; return (x+y); }}Cilk codeCilk is a faithfulfaithful extension of C. A Cilk program’s serial elisionserial elision is always a legal implementation of Cilk semantics. Cilk provides nono new data types.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 5cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawnspawn fib(n-1); y = spawnspawn fib(n-2); sync;sync; return (x+y); }}Dynamic MultithreadingThe computation computation dagdag unfolds dynamically.“Processor oblivious.”© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 6Cactus StackBACEDAABACACDACEViews of stackCBADECilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)Cilk’s cactus stack supports several views in parallel.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 7Advanced Features•Returned values can be incorporated into the parent frame using a delayed internal function called an inlet:•Within an inlet, the abort keyword causes all other children of the parent frame to be terminated.•The SYNCHED pseudovariable tests whether a sync would succeed.•A Cilk library provides mutex locks for atomicity.int y;inlet void foo (int x) { if (x > y) y = x;}…spawn foo(bar(z));© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 8Debugging SupportThe NondeterminatorNondeterminator debugging tool detects and localizes data-race bugs.FAILInformation localizing a data race.PASSEvery scheduling produces the same result.Input data set“Abelian”Cilk programA datadata racerace occurs whenever a thread modifies a location and another thread, holding no locks in common, accesses the location simultaneously.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 9Outline•Theory and Practice•A Chess Lesson•Fun with Algorithms•Work Stealing•Opinion & Conclusion© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 10Algorithmic Complexity MeasuresTP = execution time on P processors© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 11Algorithmic Complexity MeasuresTP = execution time on P processorsT1 = workwork© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 12Algorithmic Complexity MeasuresTP = execution time on P processorsT1 = workworkT∞ = critical pathcritical path© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 13Algorithmic Complexity MeasuresTP = execution time on P processors•TP ≥ T1/P•TP ≥ T∞Lower BoundsT1 = workworkT∞ = critical pathcritical path© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 14Algorithmic Complexity MeasuresTP = execution time on P processors•TP ≥ T1/P•TP ≥ T∞Lower BoundsT1 = workworkT∞ = critical pathcritical pathT1/TP = speedupspeedupT1/T∞ = parallelismparallelism© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 15Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 16Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .Proof. At each time step, … P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 17Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .Proof. At each time step, if at least P tasks are ready, …P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 18Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 19Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .If fewer than P tasks are ready, …Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 20Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .If fewer than P tasks are ready, execute all of them.Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 21Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .If fewer than P tasks are ready, execute all of them.Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3Corollary: Linear speed-up when P ≤ T1/T∞ .© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 22Cilk PerformanceCilk’s “work-stealing”“work-stealing” scheduler achieves•TP = T1/P + O(T∞) expected time (provably);•TP ≈ T1/P + T∞ time (empirically).Near-perfect linear speedup if P ≤ T1/T∞.Instrumentation in Cilk provides accurate measures of T1 and T∞ to the user.The average cost of a spawnspawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.© 2001–4 by Charles E. Leiserson and
View Full Document