Parallel Programming Overview Todd C Mowry CS 740 September 20 2007 Outline Motivating Problems application case studies Steps in creating a parallel program What a simple parallel program looks like In the three major programming models What primitives must a system support Later Performance issues and architectural interactions 2 CS 740 F 07 Page 1 Motivating Problems Simulating Ocean Currents Regular structure scientific computing Simulating the Evolution of Galaxies Irregular structure scientific computing Rendering Scenes by Ray Tracing Irregular structure computer graphics Data Mining Irregular structure information processing Not discussed here read in book 3 CS 740 F 07 Simulating Ocean Currents a Cross sections b Spatial discretization of a cross section Model as two dimensional grids Discretize in space and time finer spatial and temporal resolution greater accuracy Many different computations per time step set up and solve equations Concurrency across and within grid computations 4 CS 740 F 07 Page 2 Simulating Galaxy Evolution Simulate the interactions of many stars evolving over time Computing forces is expensive O n2 brute force approach m1m2 Hierarchical Methods take advantage of force law G r2 Star on which forces are being computed Star too close to approximate Many Large group far enough away to approximate Small group far enough away to approximate to center of mass time steps plenty of concurrency across stars within one 5 CS 740 F 07 Rendering Scenes by Ray Tracing Shoot rays into scene through pixels in image plane Follow their paths they bounce around as they strike objects they generate new rays ray tree per input ray Result is color and opacity for that pixel Parallelism across rays All case studies have abundant concurrency 6 CS 740 F 07 Page 3 Creating a Parallel Program Assumption Sequential algorithm is given Sometimes need very different algorithm but beyond scope Pieces of the job Identify work that can be done in parallel Partition work and perhaps data among processes Manage data access communication and synchronization Note work includes computation data access and I O Main goal Speedup plus low prog effort and resource needs Performance p Speedup p Performance 1 For a fixed problem Time 1 Speedup p Time p 7 CS 740 F 07 Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 O r c h e s t r a t i o n Processes p0 p1 p2 p3 Parallel program M a p p i n g P0 P1 P2 P3 Processors 4 steps Decomposition Assignment Orchestration Mapping Done by programmer or system software compiler runtime Issues are the same so assume programmer does it all explicitly 8 CS 740 F 07 Page 4 Some Important Concepts Task Arbitrary piece of undecomposed work in parallel computation Executed sequentially concurrency is only across tasks E g a particle cell in Barnes Hut a ray or ray group in Raytrace Fine grained versus coarse grained tasks Process thread Abstract entity that performs the tasks assigned to processes Processes communicate and synchronize to perform their tasks Processor Physical engine on which process executes Processes virtualize machine to programmer first write program in terms of processes then map to processors 9 CS 740 F 07 Decomposition Break up computation into tasks to be divided among processes Tasks may become available dynamically No of available tasks may vary with time i e identify concurrency and decide level at which to exploit it Goal Enough tasks to keep processes busy but not too many of tasks available at a time is upper bound on achievable speedup 10 CS 740 F 07 Page 5 Limited Concurrency Amdahl s Law Most fundamental limitation on parallel speedup If fraction s of seq execution is inherently serial speedup 1 s Example 2 phase calculation sweep over n by n grid and do some independent computation sweep again and add each value to global sum Time for first phase n2 p Second phase serialized at global variable so time n2 Speedup 2n2 n2 p or at most 2 n2 Trick divide second phase into two accumulate into private sum during sweep add per process private sum into global sum Parallel time is n2 p n2 p p and speedup at best 11 p2n2 2n2 p2 CS 740 F 07 Pictorial Depiction 1 work done concurrently a n2 n2 p b 1 n2 p n2 p 1 c Time n2 p n2 p p 12 CS 740 F 07 Page 6 Concurrency Profiles Cannot usually divide into serial and parallel part 1 400 1 200 Concurrency 1 000 800 600 400 733 702 662 633 589 564 526 504 483 444 415 380 343 313 286 247 219 0 150 200 Clock cycle number Area under curve is total work done or time with 1 processor Horizontal extent is lower bound on time infinite processors Speedup is the ratio fk k k 1 k 1 fk kp base case 1 s 1 s p Amdahl s law applies to any overhead not just limited concurrency 13 CS 740 F 07 Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 O r c h e s t r a t i o n Processes p0 p1 p2 p3 Parallel program M a p p i n g P0 P1 P2 P3 Processors 4 steps Decomposition Assignment Orchestration Mapping Done by programmer or system software compiler runtime Issues are the same so assume programmer does it all explicitly 14 CS 740 F 07 Page 7 Assignment Specifying mechanism to divide work up among processes E g which process computes forces on which stars or which rays Together with decomposition also called partitioning Balance workload reduce communication and management cost Structured approaches usually work well Code inspection parallel loops or understanding of application Well known heuristics Static versus dynamic assignment As programmers we worry about partitioning first Usually independent of architecture or prog model But cost and complexity of using primitives may affect decisions As architects we assume program does reasonable job of it 15 CS 740 F 07 Orchestration Naming data Structuring communication Synchronization Organizing data structures and scheduling tasks temporally Goals Reduce cost of communication and synch as seen by processors Preserve locality of data reference incl data structure organization Schedule tasks to satisfy dependences early Reduce overhead of parallelism management Closest to architecture and programming model language Choices depend a lot on comm abstraction efficiency of primitives Architects should provide appropriate primitives efficiently 16 CS 740 F 07 Page 8 Mapping After orchestration already have
View Full Document
Unlocking...