Parallel Vector Algorithms 1 of 34 1 Introduction We will now study several algorithms where the parallelism can be easily expressed in terms of vector operations Simplistic timing figures will be given in some cases for pipelined machines array machines and multiprocessors In these timings parallelism overhead subscript computations and memory access communications costs will be ignored 2 of 34 2 Time to execute a vector operation Let us start with the simplest possible situation Consider the following generic vector operation a 1 n b 1 n First let us assume a pipelined arithmetic unit with s stages for operation Each stage takes units of time The time to execute the vector operation under these assumptions is t pip eline s n 1 Compare this with the serial time when no pipelining takes place t serial s n 3 of 34 Consider now an array machine with P arithmetic units or a multiprocessor with P processors The execution time is t parallel n t P where t is the time to execute one operation In a system where each processor contains an arithmentic pipeline the execution time would be t parallel pipeline s 1 n P 4 of 34 3 Time to Execute a Reduction s a 1 a 2 a 3 a n A sequence of log2n vector operations of length n 2 n 4 1 suffices to compute the reduction assuming associativity Therefore assuming n 2m log n t pip e line i 1 n s 1 i s 1 log n n 1 2 5 of 34 In the case of an array machine if the final reduction is done in logarithmic time the execution time is t parallel n t log P t P If P n the time is t parallel log n t 6 of 34 4 Parallel Prefix Consider the following loop A 0 0 DO I 1 N A I A I 1 B I END DO The loop seems sequential because each iteration needs information on the value computed in the preceding iteration However we can use a parallel prefix approach to compute the value of vector A in parallel as follows B 1 B 2 B 3 B 4 B N B 1 B 1 B 2 B 2 B 3 B 3 B 4 B N 1 B N B 1 B 1 B 2 B 1 B 2 B 1 B 2 B N 3 B N 2 B 3 B 3 B 4 B N 1 B N 7 of 34 A parallel program implementing this strategy under the assumption that N 2k is A 1 N B 1 N DO I 1 K 1 A 1 2 I 1 N A 1 2 I 1 N A 1 N 2 I 1 END DO For an array machine with the number of procesing units P n t parallel t log n 8 of 34 5 Examples of Speedup and Efficiency Consider a 1 n b 1 n The speedup efficiency and redundancy on a pipelined unit are s n sn S s s s n 1 s n 1 s n n E s 1 s s n 1 s n 1 Os ns R s 1 O1 ns 9 of 34 In an array machine or multiprocessor t parallel n t P nt n S P n n t P P The value of SP is P if n is a multiple of P nt n E P P nP n t P P EP is 1 if n is a multiple of P Otherwise it is 1 RP 1 10 of 34 The speedup efficiency and redundancy of the parallel prefix example on an array machine or multiprocessor with P n are nt n S n log n t log n nt 1 E n n log n t log n n n 1 n 2 n On 2 n log n 1 1 R n log n O1 n n 1 11 of 34 6 Matrix Vector Multiplication In mathematical notation n A1i Vi A 11 A 12 A 1n A 21 A 22 A 2n A m1 A m2 A mn i 1 n V1 V2 Vn A2i Vi i 1 n Ami Vi i 1 In Fortran do i 1 m R i 0 do j 1 n R i R i A i j V j end do end do 12 of 34 do i 1 m R i DOT PRODUCT A i 1 n V 1 n end do The dot product is a vector multiplication of length n in this case followed by a reduction Time in a pipelined machine for a dot product s n 1 s 1 log n n 1 s 1 log n s 2 n 1 The total time for the matrix vector multiplication is then m s 1 log n s 2 n 1 In an array machine or in a multiprocessor the time if P n is t m log n 1 13 of 34 Alternatively by interchanging the loop headers the program could be written as follows do j 1 n do i 1 m R i R i A i j V j end do end do This leads to the following sequence of vector operations do j 1 n R 1 m R 1 m A 1 m j V j end do The time for this loop in a pipelined machine is n s m 1 In an array machine or in a multiprocessor the time if P m is t 2n 14 of 34 7 Matrix Matrix Multiplication 1 Inner product method Matrix multiplication is usually written do i 1 n do j 1 n do k 1 n C i j C i j A i k B k j end do end do end do The most direct translation of this program into vector form is do i 1 n do j 1 n C i j DOT PRODUCT A i 1 n B 1 n j end do end do 15 of 34 The time on a pipelined machine is n 2 s 1 log n s 2 n 1 The time on an array machine or multiprocessor if P n is t log n t n 2 16 of 34 2 Middle product method n parallelism This is obtained by interchanging the headers in the original matrix multiplication loop do j 1 n do k 1 n do i 1 n C i j C i j A i k B k j end do end do end do The direct translation of this loop into vector form is do j 1 n do k 1 n C 1 n j C 1 n j A 1 n k B k j end do end do 17 of 34 Alternatively the headers could have been exchanged in a different order to obtain the loop do j 1 n do k 1 n C i 1 n C i 1 n A i k B k 1 n end do end do The time on a pipelined machine is 2n 2 s n …
View Full Document
Unlocking...