1"Carnegie Mellon Lecture 12 Parallelization I. Basic Parallelization II. Data dependence analysis III. Interprocedural parallelization Chapter 11.1-11.1.4 M. Lam CS243: Parallelization 1 Carnegie Mellon Parallelization of Numerical Applications • DoAll loop parallelism – Find loops whose iterations are independent – Number of iterations typically scales with the problem – Usually much larger than the number of processors in a machine – Divide up iterations across machines M. Lam CS243: Parallelization 22"Carnegie Mellon Basic Parallelism Examples: FOR i = 1 to 100 A[i] = B[i] + C[i] FOR i = 11 TO 20 a[i] = a[i-1] + 3 FOR i = 11 TO 20 a[i] = a[i-10] + 3 • Does there exist a data dependence edge between two different iterations? • A data dependence edge is loop-carried if it crosses iteration boundaries • DoAll loops: loops without loop-carried dependences M. Lam CS243: Parallelization 3 Carnegie Mellon Recall: Data Dependences • True dependence: a = = a • Anti-dependence: = a a = • Output dependence a = a = M. Lam CS243: Parallelization 43"Carnegie Mellon Affine Array Accesses • Common patterns of data accesses: (i, j, k are loop indexes) A[i], A[j], A[i-1], A[0], A[i+j], A[2*i], A[2*i+1] A[i,j], A[i-1, j+1] • Array indexes are affine expressions of surrounding loop indexes – Loop indexes: in, in-1, ... , i1 – Integer constants: cn, cn-1, ... , c0 – Array index: cnin + cn-1in-1+ ... + c1i1+ c0 – Affine expression: linear expression + a constant term (c0) M. Lam CS243: Parallelization 5 Carnegie Mellon II. Formulating Data Dependence Analysis FOR i := 2 to 5 do A[i-2] = A[i]+1; • Between read access A[i] and write access A[i-2]there is a dependence if: – there exist two iterations ir and iw within the loop bounds, s.t. – iterations ir & iw read & write the same array element, respectively ∃integers iw, ir 2 ≤ iw, ir ≤ 5 ir = iw - 2 • Between write access A[i-2] and write access A[i-2] there is a dependence if: ∃integers iw, iv 2 ≤ iw, iv ≤ 5 iw – 2 = iv - 2 – To rule out the case when the same instance depends on itself: • add constraint iw ≠ iv M. Lam CS243: Parallelization 64"Carnegie Mellon Memory Disambiguation is Undecidable at Compile Time read(n) For i = a[i] = a[n] M. Lam CS243: Parallelization 7 Carnegie Mellon Domain of Data Dependence Analysis • Only use loop bounds and array indexes that are affine functions of loop variables for i = 1 to n for j = 2i to 100 a[i+2j+3][4i+2j][i*i] = … … = a[1][2i+1][j] • Assume a data dependence between the read & write operation if there exists: – a read instance with indexes ir, jr and – a write instance with indexes iw, jw ∃integers ir, jr, iw, jw 1 ≤ iw, ir ≤ n 2iw ≤ jw ≤ 100 2ir ≤ jr ≤ 100 iw + 2jw + 3 = 1 4iw + 2jw = 2ir + 1 • Equate each dimension of array access; ignore non-affine ones – No solution No data dependence – Solution there may be a dependence M. Lam CS243: Parallelization 85"Carnegie Mellon Complexity of Data Dependence Analysis • Equivalent to integer linear programming ∃ integer i A1i ≤ b1 A2i = b2 • Integer linear programming is NP-complete – O(size of the coefficients) or O(nn) M. Lam CS243: Parallelization 9 For every pair of accesses not necessarily distinct (F1 , f1 ) and (F2, f2) one must be a write operation Let B1i1+b1 ≥ 0, B2i2+b2 ≥ 0 be the corresponding loop bound constraints, ∃ integers i1, i2 B1i1 + b1 ≥ 0, B2i2 + b2 ≥ 0 F1 i1+ f1 = F2 i2+f2 If the accesses are not distinct, then add the constraint i1 ≠ i2 Carnegie Mellon Data Dependence Analysis Algorithm • Typically solving many tiny, repeated problems – Integer linear programming packages optimize for large problems – Use memoization to remember the results of simple tests • Apply a series of relatively simple tests – GCD: 2*i, 2*i+1; GCD for simultaneous equations – Test if the ranges overlap • Backed up by a more expensive algorithm – Use Fourier-Motzkin Elimination to test if there is a real solution • Keep eliminating variables to see if a solution remains • If there is no solution, then there is no integer solution6"Carnegie Mellon Fourier-Motzkin Elimination • To eliminate a variable from a set of linear inequalities. • To eliminate a variable x1 – Rewrite all expressions in terms of lower or upper bounds of x1 – Create a transitive constraint for each pair of lower and upper bounds. • Example: Let L, U be lower bounds and upper bounds resp – To eliminate x1: L1(x2,"…,"xn)"≤!!x1"≤""U1"(x2,"…,"xn)"L2(x2,"…,"xn)"≤!!x1"≤""U2"(x2,"…,"xn)"L1(x2,"…,"xn)"≤!!U1"(x2,"…,"xn)"L1(x2,"…,"xn)"≤!!U2"(x2,"…,"xn)"L2(x2,"…,"xn)"≤!!U1"(x2,"…,"xn)"L2(x2,"…,"xn)"≤!!U2"(x2,"…,"xn)"Carnegie Mellon Example FOR i = 1 to 5 FOR j = i+1 to 5 A[i,j] = f(A[i,i], A[i-1,j]) M. Lam CS243: Parallelization 12 """""""1"≤!i"""""""""""""""""i"≤!5"i"+"1""≤!!j"""""""""""""""j"≤!5"""""""""1"≤!i’""""""""""""""""""i’"≤!5"i’"+"1""≤!!j’""""""""""""""""j’"≤!5"1:"Data"dep"between"A[i,j],"A[i,i]""""""""i"="i’""""""""j"="i’""i’+1"≤!!i’"""2:"Data"dep"between"A[i,j]"and"A[iA1,j]"""""""""""""i"="i’"–"1"""=>"i+1"="i’"""""""""""""j"="j’""SubsGtuGng""""""""""""""1"≤!i"+"1,""""""""i"+"1"≤!5""""""""""i"+"2""≤!!j"""""""""""""""""""""j"≤!5""""""""Combining""""""""""""""1"≤"i;"i"≤"4""""""""""i"≤!!j"A2;"j"≤!5"EliminaGng"i:""""""""""""""1"≤"4;"1"≤!!j"A2;"j"≤!5""""""""""""""3"≤!!j;""""j"≤!5"EliminaGng"j:""""""""""""""3"≤!5""7"Carnegie Mellon Data Dependence Analysis Algorithm • Typically solving many tiny, repeated problems – Integer linear programming packages optimize for large problems – Use memoization to remember the results of simple tests • Apply a series of relatively simple tests – GCD: 2*i, 2*i+1; GCD for simultaneous equations – Test if the ranges overlap • Backed up by a more expensive algorithm – Use Fourier-Motzkin Elimination to test if there is a real solution • Keep eliminating variables
View Full Document