Motivating Problems Parallel Programming Simulating Ocean Currents Todd C Mowry CS 740 October 16 18 2000 Regular structure scientific computing Simulating the Evolution of Galaxies Irregular structure scientific computing Rendering Scenes by Ray Tracing Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming Model Irregular structure computer graphics Not discussed here read in book Case Studies Ocean simulation Barnes Hut N body simulation 2 Simulating Ocean Currents CS 740 F 00 Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive O n2 brute force approach m1m2 Hierarchical Methods take advantage of force law G r2 Star on which for ces are being computed a Cross sections b Spatial discretization of a cross section Model as two dimensional grids Discretize in space and time finer spatial and temporal resolution greater accuracy Many different computations per time step set up and solve equations Concurrency across and within grid computations 3 Large group far enough away to approximate Star too close to approximate Many CS 740 F 00 time steps plenty of concurrency across stars within one 4 Page 1 Small group far enough away to approximate to center of mass CS 740 F 00 Rendering Scenes by Ray Tracing Parallel Programming Task Break up computation into tasks Shoot rays into scene through pixels in image plane Follow their paths they bounce around as they strike objects they generate new rays ray tree per input ray Result is color and opacity for that pixel Parallelism across rays assign tasks to processors Break up data into chunks assign chunks to memories Introduce synchronization for All case studies have abundant concurrency 5 mutual exclusion event ordering CS 740 F 00 6 Steps in Creating a Parallel Program CS 740 F 00 Partitioning for Performance Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 Processes O r c h e s t r a t i o n p0 p1 p2 p3 Parallel program M a p p i n g P0 P1 P2 P3 Balancing the workload and reducing wait time at synch points Reducing inherent communication Reducing extra work Even these algorithmic issues trade off Minimize comm run on 1 processor extreme load imbalance Maximize load balance random assignment of tiny tasks no control over communication Good partition may imply extra work to compute or manage it Processors 4 steps Decomposition Assignment Orchestration Mapping Goal is to compromise Done by programmer or system software compiler runtime Issues are the same so assume programmer does it all explicitly 7 Fortunately often not difficult in practice CS 740 F 00 8 Page 2 CS 740 F 00 Load Balance and Synch Wait Time Limit on speedup Speedupproblem p Deciding How to Manage Concurrency Sequential Work Static versus Dynamic techniques Static Max Work on any Processor Work includes data access and other costs Not just equal work but must be busy at same time Four parts to load balance and reducing synch wait time Algorithmic assignment based on input won t change Low runtime overhead Computation must be predictable Preferable when applicable except in multiprogrammed heterogeneous environment Dynamic 1 Identify enough concurrency Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads 2 Decide how to manage it 3 Determine the granularity at which to exploit it 4 Reduce serialization and cost of synchronization 9 CS 740 F 00 10 Dynamic Assignment CS 740 F 00 Dynamic Tasking with Task Queues Profile based semi static Centralized versus distributed queues Task stealing with distributed queues Profile work distribution at runtime and repartition dynamically Applicable in many computations e g Barnes Hut some graphics Dynamic Tasking Deal with unpredictability in program or environment e g Raytrace computation communication and memory system interactions multiprogramming and heterogeneity used by runtime systems and OS too Pool of tasks take and add tasks until done E g self scheduling of loop iterations shared loop counter Can compromise comm and locality and increase synchronization Whom to steal from how many tasks to steal Termination detection Maximum imbalance related to size of task All pr ocesses insert tasks QQ 0 P1 inserts P2 inserts P3 inserts Q1 Q2 Q3 Others may steal All r emove tasks a Centralized task queue 11 P0 inserts CS 740 F 00 12 Page 3 P0 removes P1 removes P2 removes P3 removes b Distributed task queues one per pr ocess CS 740 F 00 Determining Task Granularity Reducing Serialization Careful about assignment and orchestration including scheduling Event synchronization Task granularity amount of work associated with a task Reduce use of conservative synchronization e g point to point instead of barriers or granularity of pt to pt But fine grained synch more difficult to program more synch ops General rule Coarse grained often less load balance Fine grained more overhead often more communication and contention Mutual exclusion Separate locks for separate data e g locking records in a database lock per process record or field lock per task in task queue not per queue finer grain less contention serialization more space less reuse Smaller less frequent critical sections don t do reading testing in critical section only modification e g searching for task to dequeue in task queue building tree Stagger critical sections in time Communication and contention actually affected by assignment not size Overhead by size itself too particularly with task queues 13 CS 740 F 00 14 Reducing Inherent Communication CS 740 F 00 Domain Decomposition Communication is expensive Measure communication to computation ratio Focus here on inherent communication Works well for scientific engineering graphics applications Exploits local biased nature of physical problems Information requirements often short range Or long range but fall off with distance Determined by assignment of tasks to processes Later see that actual communication can be greater Simple example nearest neighbor grid computation Assign tasks that access same data to same process Solving communication and load balance NP hard in general case But simple heuristic solutions work well in practice n n p P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 n Applications have structure n p Perimeter to Area comm to comp ratio area to volume in 3D Depends on n p decreases with n
View Full Document
Unlocking...