Parallelization of An Example ProgramGrid Solver ExampleSlide 3DecompositionExploit Application KnowledgeDecomposition OnlyAssignmentData Parallel SolverShared Address Space SolverSlide 10Notes on SAS ProgramNeed for Mutual ExclusionMutual ExclusionGlobal Event SynchronizationPointt-to-point Event Synch (Not Used Here)Group Event SynchronizationMessage Passing Grid SolverSlide 18Notes on Message Passing ProgramSend and Receive AlternativesOrchestration: SummaryCorrectness in Grid Solver Program1Parallelization of An Example ProgramExamine a simplified version of a piece of Ocean simulation•Iterative equation solverIllustrate parallel program in low-level parallel language•C-like pseudocode with simple extensions for parallelism•Expose basic communication and synch. primitives that must be supported2Grid Solver Example•Simplified version of solver in Ocean simulation•Gauss-Seidel (near-neighbor) sweeps to convergence–interior n-by-n points of (n+2)-by-(n+2) updated in each sweep–updates done in-place in grid, and diff. from prev. value computed–accumulate partial diffs into global diff at end of every sweep–check if error has converged (to within a tolerance parameter)–if so, exit solver; if not, do another sweepA[i,j ] = 0.2 (A[i,j] + A[i,j – 1] + A[i – 1, j] +A[i,j + 1] + A[i + 1, j])Expression for updating each interior point:31. i n t n ;/*size of matrix: (n + 2-by-n + 2) elements*/2. fl o a t * * A , d i ff = 0 ;3. m a i n ( )4. b e g i n5. r e a d ( n ) ; /*read input parameter: matrix size*/6. A m a l l o c ( a 2 - d a r r a y o f s i z e n + 2 b y n + 2 d o u b l e s ) ;7. i n i t i a l i z e ( A ) ;/*initialize the matrix A somehow*/8. S o l v e ( A ) ;/*call the routine to solve equation*/9. e n d m a i n10.p r o c e d u r e S o l v e ( A )/*solve the equation system*/11.fl o a t * * A ; /*A is an (n + 2)-by-(n + 2) array*/12.b e g i n13.i n t i , j , d o n e = 0 ;14.fl o a t d i ff = 0 , t e m p ;15.w h i l e ( ! d o n e ) d o/*outermost loop over sweeps*/16.d i ff = 0 ; /*initialize maximum difference to 0*/17.f o r i 1 t o n d o/*sweep over nonborder points of grid*/18.f o r j 1 t o n d o19.t e m p = A [ i , j ] ;/*save old value of element*/20.A [ i , j ] 0 . 2 * ( A [ i , j ] + A [ i , j - 1 ] + A [ i - 1 , j ] +21.A [ i , j + 1 ] + A [ i + 1 , j ] ) ;/*compute average*/22.d i ff + = a b s ( A [ i , j ] - t e m p ) ;23.e n d f o r24.e n d f o r25.i f ( d i ff / ( n * n ) < T O L ) t h e n d o n e = 1 ;26.e n d w h i l e27.e n d p r o c e d u r e4Decomposition•Concurrency O(n) along anti-diagonals, serialization O(n) along diag.•Retain loop structure, use pt-to-pt synch; Problem: too many synch ops.•Restructure loops, use global synch; imbalance and too much synch•Simple way to identify concurrency is to look at loop iterations–dependence analysis; if not enough concurrency, then look further•Not much concurrency here at this level (all loops sequential)•Examine fundamental dependences, ignoring loop structure5Exploit Application Knowledge•Different ordering of updates: may converge quicker or slower •Red sweep and black sweep are each fully parallel: •Global synch between them (conservative but convenient)•Ocean uses red-black; we use simpler, asynchronous one to illustrate–no red-black, simply ignore dependences within sweep–sequential order same as original, parallel program nondeterministicRed pointBlack point•Reorder grid traversal: red-black ordering6Decomposition Only•Decomposition into elements: degree of concurrency n2•To decompose into rows, make line 18 loop sequential; degree n•for_all leaves assignment left to system–but implicit global synch. at end of for_all loop15. while (!done) do/*a sequential loop*/16. diff = 0; 17. for_all i 1 to n do/*a parallel loop nest*/18. for_all j 1 to n do19. temp = A[i,j];20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);22. diff += abs(A[i,j] - temp);23. end for_all24. end for_all25. if (diff /(n*n) < TOL) then done = 1;26. end while7Assignment•Dynamic assignment–get a row index, work on the row, get a new row, and so on•Static assignment into rows reduces concurrency (from n to p)–block assign. reduces communication by keeping adjacent rows together•Let’s dig into orchestration under three programming modelsipP0P1P2P4•Static assignments (given decomposition into rows)–block assignment of rows: Row i is assigned to process –cyclic assignment of rows: process i is assigned rows i, i+p, and so on8Data Parallel Solver1. int n, nprocs;/*grid size (n + 2-by-n + 2) and number of processes*/2. fl oat **A, diff = 0;3. main()4. begin5. read(n); read(nprocs);; /*read input grid size and number of processes*/6. A G_MALLOC (a 2-d array of size n+2 by n+2 doubles);7. initialize(A);/*initialize the matrix A somehow*/8. Solve (A);/*call the routine to solve equation*/9. end main10. procedure Solve(A)/*solve the equation system*/11. fl oat **A; /*A is an (n + 2-by-n + 2) array*/12. begin13. int i, j, done = 0;14. fl oat mydif = 0, temp;14a.DECOMP A[BLOCK,*, nprocs];15. while (!done) do/*outermost loop over sweeps*/16.mydif = 0; /*initialize maximum difference to 0*/17.for_all i 1 to n do/*sweep over non-border points of grid*/18.for_all j 1 to n do19. temp = A[i,j];/*save old value of element*/20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +21. A[i,j+1] + A[i+1,j]);/*compute average*/22. mydiff += abs(A[i,j] - temp);23.end for_all24.end for_all24a.REDUCE (mydif , dif , ADD);25. if (diff /(n*n) < TOL) then done = 1;26. end while27. end procedure9Shared Address Space Solver•Assignment controlled by values of variables used as loop boundsSingle Program Multiple Data (SPMD)SweepTest ConvergenceProcessesSolve Solve Solve Solve101. int n, nprocs; /*matrix dimension and number of processors to be used*/2a. fl oat **A, diff ;/*A is global (shared) array representing the grid*//*diff is global (shared) maximum difference in currentsweep*/2b.LOCKDEC(dif _lock);/*declaration of lock to enforce mutual exclusion*/2c.BARDEC (bar1);/*barrier declaration for global synchronization betweensweeps*/3. main()4. begin5. read(n); read(nprocs);/*read input matrix size and number of processes*/6. A G_MALLOC (a
View Full Document