Unformatted text preview:

5 5 Ordered sections Sometimes the only way to avoid races is to execute the code serially Consider the looP do i 1 n a i b i 1 c i sin c i 1 1 d i c i d i 1 2 end do Although there is no clear way to avoid races in this loop we could execute in parallel the first statement In fact we can in this case transform the loop into c omp parallel do do i 1 n a i b i 1 end do do i 1 n c i sin c i 1 1 d i c i d i 1 2 end do 1 of 36 However there is a way to improve the performance of the whole loop with the ordered directive whose syntax is as follows c omp ordered name block c omp end ordered name The interleaving of the statements in the ordered sections of different iterations are identical to that of the sequential program Ordered sections without a name are all assumed to have the same name Thus the previous loop can be rewritten as do i 1 n a i b i 1 c omp ordered x c i sin c i 1 1 c omp end ordered x c omp ordered y d i c i d i 1 2 c omp end ordered y end do 2 of 36 Thus we have tto ways of executing the loop in parallel Assuming n 12 and four processors we would have the following time lines in each case a a a a a a a a a a a a a a a a c d c d c 3 of 36 Notice that now no races exist because accesses to the same memory location are always performed in the same order Ordered sections may need to include more than one statement For example in the loop do i 1 n a i b i 1 1 b i a i c i end do the possibility of races would not be avoided unless both statements are made part of the same ordered section 4 of 36 It is important to make the ordered sections as small as possible because the overall execution time depends on the size of the longest ordered section 5 of 36 5 6 Execution time of a parallel do when ordered sections have constant execution time Consider the loop c omp parallel do do i 1 n c omp ordered a aa c omp end ordered a c omp ordered b c omp ordered c c omp ordered d c omp ordered e end do 6 of 36 Assume its execution time lines have the following form a b a c d b c a e d b a c b e d c e d e which in terms of performance is equivalent to the 7 of 36 following time lines a b D c a d b D D D D e c a d b D e c a d b c e d e where a constant delay D between the start of consecutive iterations is evident This delay is equal to the time of the longest ordered section i e D T c in this case The execution time of the previous loop using n processors is 8 of 36 T a T b nT c T d T e as can be seen next T a T b nT c nD T d T e a b c a d e b c a d b c e d e In general the execution time when there are as many processors as iterations is nD B D n 1 D B where B is the execution time of the whole loop body 9 of 36 5 6 Critical Regions and Reductions Consider the following loop do i 1 n do i 1 m ia i j b i j d i j isum isum ia i j end end Here we have a race due to isum This race cannot be removed by the techniques discussed above However the operation used to compute isum is associative and isum only appears in the statement that computes its value The integer addition operation is not really associative but in practice we can assume it is if the numbers are small enough so there is never any overflow 10 of 36 Under these circumstances the loop can be transformed into the following form c omp c omp c omp c omp c omp c omp parallel private local isum local isum 0 pdo do i 1 n do j 1 m local isum local isum ia j i end do end do end pdo nowait critical isum isum local isum end critical end parallel 11 of 36 Here we use the critical directive to avoid the following problem The statement isum isum local isum will be translated the following load load add store into a machine language sequence similar to register 1 isum register 2 local isum register 3 register 1 register 2 register 3 isum Assume now there are two tasks executing the statement isum isum local isum simultaneously In one local sum is 10 and in the other 15 Assume isum is 0 when both tasks start executing the statement Consider the following sequence of events 12 of 36 time task 1 isum task 2 1 load r1 local isum 0 2 load r2 isum 0 load r1 local isum 3 add r3 r2 r1 0 load r2 isum 4 store r3 isum 10 add r3 r2 r1 15 store r3 isum As can be seen interleaving the instructions between the two tasks produces incorrect results The critical directive precludes this interleaving Only one task at a time can execute a critical region with the same name The assumption is that it does not matter in which order the tasks enter a critical region as long as they are never inside a critical region of the same name at the same time 13 of 36 An alternative way of writing the above parallel loop is c omp parallel do reduction isum do i 1 n do j 1 n isum isum ia j i end do end do The reduction clause can be applied to a number of operations and intrinsic functions 14 of 36 Chapter 6 Parallel Vector Algorithms 15 of 36 6 1 Introduction We will now study several algorithms where the parallelism can be easily expressed in terms of vector operations Simplistic timing figures will be given in some cases for pipelined machines array machines and multiprocessors In these timings parallelism overhead subscript computations and memory access communications costs will be ignored 16 of 36 6 2 Time to execute a vector operation Let us start with the simplest possible situation Consider the following generic vector operation a 1 n b 1 n First let us assume a pipelined arithmetic unit with s stages for operation Each stage takes units of time The time to execute the vector operation under these assumptions is tpip eline s n 1 Compare this with the …


View Full Document
Loading Unlocking...
Login

Join to view Lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?