Unformatted text preview:

Steps in Creating a Parallel Program Partitioning Communication Abstraction Parallel Algorithm Computational Problem D e c o m p o s i t i o n A s s i g n m e n t Fine grain Parallel Computations Tasks p0 p1 p2 p3 O r c h e s t r a t i o n Mapping Scheduling Tasks Processes p0 p1 p2 p3 Processes Processors M a p p i n g Execution Order scheduling P0 P1 P2 P3 Processes Sequential Fine grain Tasks Parallel computation Computations Tasks Processes Parallel program Processors 4 steps Decomposition Assignment Orchestration Mapping Performance Goal of the steps Maximize parallel speedup minimize resulting parallel execution time by 1 Balancing computations and overheads on processors every processor does the same amount of work overheads 2 Minimizing communication cost and other overheads associated with each step Parallel Computer Architecture Chapter 3 EECC756 Shaaban 1 lec 6 Spring2008 4 8 2008 Parallel Programming for Performance A process of Successive Refinement of the steps Partitioning for Performance Load Balancing and Synchronization Wait Time Reduction Identifying Managing Concurrency Static Vs Dynamic Assignment or tasking Determining Optimal Task Granularity Reducing Serialization Synch Wait Time Waiting time as a result of data dependency Reducing Inherent Communication Minimizing communication to computation ratio C to C Ratio Efficient Domain Decomposition Reducing Additional Overheads Orchestration Mapping for Performance Extended Memory Hierarchy View of Multiprocessors Exploiting Spatial Locality Reduce Artifactual Communication Structuring Communication or Extra Reducing Contention Overlapping Communication Parallel Computer Architecture Chapter 3 EECC756 Shaaban 2 lec 6 Spring2008 4 8 2008 Successive Refinement of Parallel Program Performance Partitioning is possibly independent of architecture and may be done first initial partition View machine as a collection of communicating processors Balancing the workload across tasks processes processors Reducing the amount of inherent communication Reducing extra work to find a good assignment Above three issues are conflicting Then deal with interactions with architecture Orchestration Mapping View machine as an extended memory hierarchy Reduce artifactual extra communication due to architectural interactions Cost of communication depends on how it is structured possible overlap with computation Hardware Architecture This may inspire changes in partitioning EECC756 Shaaban 3 lec 6 Spring2008 4 8 2008 Partitioning for Performance 1 Balancing the workload across tasks processes Reducing wait time at synchronization points needed to satisfy data dependencies among tasks 2 Reduce Overheads Reducing interprocess inherent communication Reducing extra work needed to find a good assignment The above goals lead to two extreme trade offs Minimize communication run on 1 processor One large task extreme load imbalance Maximize load balance random assignment of tiny tasks no control over communication A good partition may imply extra work to compute or manage it The goal is to compromise between the above extremes EECC756 Shaaban 4 lec 6 Spring2008 4 8 2008 Load Balancing and Synch Wait Time Reduction Limit on speedup Synch wait time process task wait time as a result of data dependency on another task until the dependency is satisfied Sequential Work Max Work on any Processor on any processor Speedupproblem p Work includes computing data access and other costs Not just equal work but must be busy computing at same time to minimize synchronization wait time to satisfy dependencies Four parts to load balancing and reducing synch wait time 1 Identify enough concurrency in decomposition 2 Decide how to manage the concurrency statically or dynamically 3 Determine the granularity task grain size at which to exploit it 4 Reduce serialization and cost of synchronization EECC756 Shaaban 5 lec 6 Spring2008 4 8 2008 Identifying Concurrency Decomposition 1 2 3 Concurrency may be found by Examining loop structure of sequential algorithm Fundamental data dependencies dependency analysis graph Exploit the understanding of the problem to devise parallel algorithms with more concurrency e g ocean equation solver Software Algorithm Parallelism Types 1 Data Parallelism versus 2 Functional Parallelism 1 Data Parallelism Similar parallel operation sequences performed on elements of large data structures e g ocean equation solver pixel level image processing Such as resulting from parallelization of loops Usually easy to load balance e g ocean equation solver Degree of concurrency usually increase with input or problem size e g O n2 in equation solver example Software Algorithm Parallelism Types were also covered in lecture 3 slide 33 EECC756 Shaaban 6 lec 6 Spring2008 4 8 2008 Identifying Concurrency continued 2 Functional Parallelism Entire large tasks procedures with possibly different functionality that can be done in parallel on the same or different data e g different independent grid computations in Ocean Software Pipelining Different functions or software stages of the pipeline performed on different data As in video encoding decoding or polygon rendering Concurrency degree usually modest and does not grow with input size Difficult to load balance Often used to reduce synch wait time between data parallel phases Most scalable parallel programs more concurrency as problem size increases parallel programs Data parallel programs per this loose definition Functional parallelism can still be exploited to reduce synchronization wait time between data parallel phases EECC756 Shaaban Software Algorithm Parallelism Types were also covered in lecture 3 slide 33 7 lec 6 Spring2008 4 8 2008 Managing Concurrency Assignment Goal Obtain an assignment with a good load balance among tasks and processors in mapping step Static versus Dynamic Assignment Static Assignment e g equation solver Algorithmic assignment usually based on input data does not change at run time of computations into tasks P0 At Compilation Time Low run time overhead at compilation time P1 Computation must be predictable P2 P4 Preferable when applicable lower overheads Example 2D Ocean Equation Solver Dynamic Assignment Or tasking Needed when computation not fully predictable At Run Time Adapt partitioning at run time to balance load on processors Can increase communication cost and reduce data locality Can increase run time task management overheads Counts as extra work EECC756 Shaaban 8 lec


View Full Document

RIT EECC 756 - Steps in Creating a Parallel Program

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Steps in Creating a Parallel Program and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Steps in Creating a Parallel Program and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?