Unformatted text preview:

Parallelization of An Example Program Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program in low level parallel language C like pseudocode with simple extensions for parallelism Expose basic communication and synch primitives that must be supported 1 Grid Solver Example Expression for updating each interior point A i j 0 2 A i j A i j 1 A i 1 j A i j 1 A i 1 j Simplified version of solver in Ocean simulation Gauss Seidel near neighbor sweeps to convergence interior n by n points of n 2 by n 2 updated in each sweep updates done in place in grid and diff from prev value computed accumulate partial diffs into global diff at end of every sweep check if error has converged to within a tolerance parameter if so exit solver if not do another sweep 2 1 i n t n 2 fl o a t A d i ff size of matrix n 2 by n 2 elements 0 3 m a i n 4 b e g i n 5 read input parameter matrix size read n 6 A malloc a 2 d array of size n 2 by n 2 doubles 7 initialize the matrix A somehow initialize A 8 call the routine to solve equation Solve A 9 e n d m a i n 10 p r o c e d u r e S o l v e A solve the equation system 11 A is an n 2 by n 2 array fl o a t A 12 b e g i n 13 int i j done 0 14 fl o a t d i ff 0 t e m p 15 outermost loop over sweeps while done do 16 initialize maximum difference to 0 d i ff 0 17 sweep over nonborder points of grid for i 1 to n do 18 for j 1 to n do 19 save old value of element temp A i j 20 A i j 0 2 A i j A i j 1 A i 1 j 21 A i j 1 A i 1 j compute average 22 d i ff a b s A i j t e m p 23 end for 24 end for 25 i f d i ff n n T O L t h e n d o n e 1 26 end while 27 e n d p r o c e d u r e 3 Decomposition Simple way to identify concurrency is to look at loop iterations dependence analysis if not enough concurrency then look further Not much concurrency here at this level all loops sequential Examine fundamental dependences ignoring loop structure Concurrency O n along anti diagonals serialization O n along diag Retain loop structure use pt to pt synch Problem too many synch ops Restructure loops use global synch imbalance and too much synch 4 Exploit Application Knowledge Reorder grid traversal red black ordering Red point Black point Different ordering of updates may converge quicker or slower Red sweep and black sweep are each fully parallel Global synch between them conservative but convenient Ocean uses red black we use simpler asynchronous one to illustrate no red black simply ignore dependences within sweep sequential order same as original parallel program nondeterministic 5 Decomposition Only 15 while done do a sequential loop 16 diff 0 17 for all i 1 to n do a parallel loop nest 18 for all j 1 to n do 19 temp A i j 20 A i j 0 2 A i j A i j 1 A i 1 j 21 A i j 1 A i 1 j 22 diff abs A i j temp 23 end for all 24 end for all 25 if diff n n TOL then done 1 26 end while Decomposition into elements degree of concurrency n2 To decompose into rows make line 18 loop sequential degree n for all leaves assignment left to system but implicit global synch at end of for all loop 6 Assignment Static assignments given decomposition into rows i block assignment of rows Row i is assigned to process p cyclic assignment of rows process i is assigned rows i i p and so on P0 P1 P2 P4 Dynamic assignment get a row index work on the row get a new row and so on Static assignment into rows reduces concurrency from n to p Let s dig into orchestration under three programming models block assign reduces communication by keeping adjacent rows together 7 Data Parallel Solver grid size n 2 by n 2 and number of processes 1 2 int n nprocs fl oat A diff 3 4 5 6 7 8 9 main begin read input grid size and number of processes read n read nprocs A G MALLOC a 2 d array of size n 2 by n 2 doubles initialize the matrix A somehow initialize A call the routine to solve equation Solve A end main 0 solve the equation system 10 procedure Solve A A is an n 2 by n 2 array 11 fl oat A 12 begin 13 int i j done 0 14 fl oat mydif 0 temp 14a DECOMP A BLOCK nprocs outermost loop over sweeps 15 while done do initialize maximum difference to 0 16 mydif 0 sweep over non border points of grid 17 for all i 1 to n do 18 for all j 1 to n do save old value of element 19 temp A i j 20 A i j 0 2 A i j A i j 1 A i 1 j compute average 21 A i j 1 A i 1 j 22 mydiff abs A i j temp 23 end for all 24 end for all 24a REDUCE mydif dif ADD 25 if diff n n TOL then done 1 26 end while 27 end procedure 8 Shared Address Space Solver Single Program Multiple Data SPMD Processes Solve Solve Solve Solve Sweep Test Convergence Assignment controlled by values of variables used as loop bounds 9 1 2a int n nprocs fl oat A diff 2b 2c LOCKDEC dif lock BARDEC bar1 matrix dimension and number of processors to be used A is global shared array representing the grid diff is global shared maximum difference in current sweep declaration of lock to enforce mutual exclusion barrier declaration for global synchronization between sweeps 3 4 5 6 7 8a 8 8b 9 main begin 10 11 procedure Solve A fl oat A 12 13 14 14a 14b begin int i j pid done 0 fl oat temp mydif 0 int mymin 1 pid n nprocs int mymax mymin n nprocs 1 15 16 16a 17 18 19 20 21 22 23 24 25a 25b 25c 25d 25e outer loop over all diagonal elements while done do set global diff to 0 okay for all to do it mydif diff 0 ensure all reach here before anyone modifies diff BARRIER bar1 nprocs for each of my rows for i mymin to mymax do for all nonborder …


View Full Document

RIT EECC 756 - Parallelization

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Parallelization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Parallelization and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?