MIT 6 189 - Design and Analysis of Dynamic Multithreaded Algorithms

Unformatted text preview:

Dr. Bradley Kuszmaul, MIT. 6.189 IAP 2007 MIT6.189 IAP 2007Lecture 15Cilk© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 1Design and Analysis of Dynamic Multithreaded AlgorithmsBradley C. KuszmaulMIT Computer Science and Artificial Intelligence Laboratory© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 2Shared-Memory Multiprocessor•Symmetric multiprocessor (SMP)Symmetric multiprocessor (SMP)•Cache-coherent nonuniform memory Cache-coherent nonuniform memory architecture (CC-NUMA)architecture (CC-NUMA)P P PShared Network…Memory I/O$ $ $© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 3A C language for dynamic multithreading with a provably good runtime system.CilkPlatforms•Sun UltraSPARC Enterprise•SGI Origin 2000•Compaq/Digital Alphaserver•Intel Pentium SMP’sApplications•virus shell assembly•graphics rendering•n-body simulationSocrates and CilkchessCilk automatically manages low-level aspects of parallel execution, including protocols, load balancing, and scheduling.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 4Fibonacciint fib (int n) {if (n<2) return (n); else { int x,y; x = fib(n-1); y = fib(n-2); return (x+y); }}C elisioncilkcilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawnspawn fib(n-1); y = spawnspawn fib(n-2); sync;sync; return (x+y); }}Cilk codeCilk is a faithfulfaithful extension of C. A Cilk program’s serial elisionserial elision is always a legal implementation of Cilk semantics. Cilk provides nono new data types.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 5cilk int fib (int n) { if (n<2) return (n); else { int x,y; x = spawnspawn fib(n-1); y = spawnspawn fib(n-2); sync;sync; return (x+y); }}Dynamic MultithreadingThe computation computation dagdag unfolds dynamically.“Processor oblivious.”© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 6Cactus StackBACEDAABACACDACEViews of stackCBADECilk supports C’s rule for pointers: A pointer to stack space can be passed from parent to child, but not from child to parent. (Cilk also supports malloc.)Cilk’s cactus stack supports several views in parallel.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 7Advanced Features•Returned values can be incorporated into the parent frame using a delayed internal function called an inlet:•Within an inlet, the abort keyword causes all other children of the parent frame to be terminated.•The SYNCHED pseudovariable tests whether a sync would succeed.•A Cilk library provides mutex locks for atomicity.int y;inlet void foo (int x) { if (x > y) y = x;}…spawn foo(bar(z));© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 8Debugging SupportThe NondeterminatorNondeterminator debugging tool detects and localizes data-race bugs.FAILInformation localizing a data race.PASSEvery scheduling produces the same result.Input data set“Abelian”Cilk programA datadata racerace occurs whenever a thread modifies a location and another thread, holding no locks in common, accesses the location simultaneously.© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 9Outline•Theory and Practice•A Chess Lesson•Fun with Algorithms•Work Stealing•Opinion & Conclusion© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 10Algorithmic Complexity MeasuresTP = execution time on P processors© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 11Algorithmic Complexity MeasuresTP = execution time on P processorsT1 = workwork© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 12Algorithmic Complexity MeasuresTP = execution time on P processorsT1 = workworkT∞ = critical pathcritical path© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 13Algorithmic Complexity MeasuresTP = execution time on P processors•TP ≥ T1/P•TP ≥ T∞Lower BoundsT1 = workworkT∞ = critical pathcritical path© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 14Algorithmic Complexity MeasuresTP = execution time on P processors•TP ≥ T1/P•TP ≥ T∞Lower BoundsT1 = workworkT∞ = critical pathcritical pathT1/TP = speedupspeedupT1/T∞ = parallelismparallelism© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 15Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 16Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .Proof. At each time step, … P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 17Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .Proof. At each time step, if at least P tasks are ready, …P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 18Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 19Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .If fewer than P tasks are ready, …Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 20Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .If fewer than P tasks are ready, execute all of them.Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 21Greedy SchedulingTheorem [Graham & Brent]: There exists an execution with TP ≤ T1/P + T∞ .If fewer than P tasks are ready, execute all of them.Proof. At each time step, if at least P tasks are ready, execute P of them. P = 3Corollary: Linear speed-up when P ≤ T1/T∞ .© 2001–4 by Charles E. Leiserson and Bradley C. Kuszmaul 22Cilk PerformanceCilk’s “work-stealing”“work-stealing” scheduler achieves•TP = T1/P + O(T∞) expected time (provably);•TP ≈ T1/P + T∞ time (empirically).Near-perfect linear speedup if P ≤ T1/T∞.Instrumentation in Cilk provides accurate measures of T1 and T∞ to the user.The average cost of a spawnspawn in Cilk-5 is only 2–6 times the cost of an ordinary C function call, depending on the platform.© 2001–4 by Charles E. Leiserson and


View Full Document

MIT 6 189 - Design and Analysis of Dynamic Multithreaded Algorithms

Download Design and Analysis of Dynamic Multithreaded Algorithms
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Design and Analysis of Dynamic Multithreaded Algorithms and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Design and Analysis of Dynamic Multithreaded Algorithms 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?