Unformatted text preview:

Steps in Creating a Parallel Program Partitioning D e c o m p o s i t i o n Sequential computation A s s i g n m e n t Tasks p0 p1 p2 p3 Processes O r c h e s t r a t i o n p0 p1 p2 p3 Parallel program M a p p i n g P0 P1 P2 P3 Processors 4 steps Decomposition Assignment Orchestration Mapping Performance Goal of the steps Minimize resulting execution time by Balancing computations on processors every processor does the same amount of work Minimizing communication cost and other overheads associated with each step EECC756 Shaaban 1 lec 5 Spring2002 3 Parallel Programming for Performance A process of Successive Refinement of the steps Partitioning for Performance Load Balancing and Synch Wait Time Reduction Identifying Managing Concurrency Static Vs Dynamic Assignment Determining Optimal Task Granularity Reducing Serialization Reducing Inherent Communication Minimizing communication to computation ratio Efficient Domain Decomposition Reducing Additional Overheads Orchestration Mapping for Performance Extended Memory Hierarchy View of Multiprocessors Exploiting Spatial Locality Reduce Artifactual Communication Structuring Communication Reducing Contention Overlapping Communication EECC756 Shaaban 2 lec 5 Spring2002 3 Successive Refinement of Parallel Program Performance Partitioning is often independent of architecture and may be done first View machine as a collection of communicating processors Balancing the workload across processes processors Reducing the amount of inherent communication Reducing extra work to find a good assignment Above three issues are conflicting Then deal with interactions with architecture Orchestration Mapping View machine as an extended memory hierarchy Extra communication due to architectural interactions Cost of communication depends on how it is structured This may inspire changes in partitioning EECC756 Shaaban 3 lec 5 Spring2002 3 Partitioning for Performance Balancing the workload across processes reducing wait time at synch points Reducing interprocess inherent communication Reducing extra work needed to find a good assignment These algorithmic issues have extreme trade offs Minimize communication run on 1 processor extreme load imbalance Maximize load balance random assignment of tiny tasks no control over communication Good partition may imply extra work to compute or manage it The goal is to compromise between the above extremes EECC756 Shaaban 4 lec 5 Spring2002 3 Load Balancing and Synch Wait Time Reduction Limit on speedup Speedupproblem p Sequential Work Max Work on any Processor Work includes data access and other costs Not just equal work but must be busy at same time to minimize synch wait time Four parts to load balancing and reducing synch wait time 1 Identify enough concurrency 2 Decide how to manage it 3 Determine the granularity at which to exploit it 4 Reduce serialization and cost of synchronization EECC756 Shaaban 5 lec 5 Spring2002 3 Identifying Concurrency Decomposition Concurrency may be found by Examining loop structure of sequential algorithm Fundamental data dependences Exploit the understanding of the problem to devise algorithms with more concurrency Data Parallelism versus Function Parallelism Data Parallelism Parallel operation sequences performed on elements of large data structures Such as resulting from parallization of loops Usually easy to load balance Degree of concurrency usually increase with input or problem size EECC756 Shaaban 6 lec 5 Spring2002 3 Identifying Concurrency continued Function or Task parallelism Entire large tasks procedures that can be done in parallel on the same or different data e g different independent grid computations in Ocean Software Pipelining Different functions or software stages of the pipeline performed on different data As in video encoding decoding or polygon rendering Degree usually modest and does not grow with input size Difficult to load balance Often used to reduce synch between data parallel phases Most scalable parallel programs Data parallel per this loose definition Function parallelism reduces synch between data parallel phases EECC756 Shaaban 7 lec 5 Spring2002 3 Levels of Parallelism in VLSI Routing W1 Wire Parallelism W2 W3 a Wire W2 expands to segments Segment Parallelism S21 S22 S23 S24 S25 S26 b Segment S23 expands to r outes Route Parallelism c EECC756 Shaaban 8 lec 5 Spring2002 3 Managing Concurrency Assignment Goal Obtain an assignment with a good load balance among tasks and processors in mapping step Static versus Dynamic Assignment Static Assignment Algorithmic assignment usually based on input data does not change at run time Low run time overhead Computation must be predictable Preferable when applicable Dynamic Assignment Adapt partitioning at run time to balance load on processors Can increase communication cost and reduce data locality Can increase run time task management overheads EECC756 Shaaban 9 lec 5 Spring2002 3 Dynamic Assignment Mapping Profile based semi static Profile algorithm work distribution initially at runtime and repartition dynamically Applicable in many computations e g Barnes Hut some graphics Dynamic Tasking Deal with unpredictability in program or environment e g Ray trace Computation communication and memory system interactions Multiprogramming and heterogeneity Used by runtime systems and OS too Pool of tasks take and add tasks until done E g self scheduling of loop iterations shared loop counter EECC756 Shaaban 10 lec 5 Spring2002 Dynamic Tasking with Task Queues Centralized versus distributed queues Task stealing with distributed queues Can compromise communication and data locality and increase synchronization Whom to steal from how many tasks to steal Termination detection all queues empty Maximum load imbalance possible related to size of task All pr ocesses insert tasks P0 inserts QQ 0 P1 inserts P2 inserts P3 inserts Q1 Q2 Q3 P2 removes P3 removes Others may steal All remove tasks a Centralized task queue P0 removes P1 removes b Distributed task queues one per pr ocess EECC756 Shaaban 11 lec 5 Spring2002 Impact of Dynamic Assignment On SGI Origin 2000 cache coherent shared memory 30 Origin semistatic Challenge semistatic Origin static Challenge static 25 10 Speedup Speedup 15 20 15 10 a 0 5 Origin dynamic Challenge dynamic Origin static Challenge static 25 20 30 5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of processors Barnes Hut 512k particle b 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of processors


View Full Document

RIT EECC 756 - Steps in Creating a Parallel Program

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Steps in Creating a Parallel Program and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Steps in Creating a Parallel Program and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?