1Programming for Parallelismand Locality withHierarchically Tiled ArraysProgram Optimization Fall 07University of Illinois,Urbana-ChampaignPeople working in this project:Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger,Gheorghe Almasi*, Basilio B.Fraguela+, Maria J.Garzaran, David Padua, Christoph von Praun*University of Illinois,Urbana-Champaign,*IBM T. J. Watson Research Center,+Universidade de Coruna, Spain210/26/07 HTA 3Motivation Memory Hierarchy Determines performance Increasingly complex Registers – Cache (L1..L3) – Memory - Distributed Memory Given the importance of memory hierarchy,programming constructs that allow the control oflocality are necessary.10/26/07 HTA 4Contributions Design of a programming paradigm that facilitates control of locality (in sequential and parallel programs) communication (in parallel programs) Implementations in MATLAB and C++ Distributed Memory (Clusters) Shared Memory (Multi-core) Implementation of NAS benchmarks in our paradigm Evaluation of productivity and performance310/26/07 HTA 5Motivation Of Our ResearchHierarchically Tiled ArraysExamples and Additional FeaturesEvaluationConclusion & Future workOutline of the talk10/26/07 HTA 6 Tiled Array: Array partitioned into tiles Hierarchically Tiled Array: Tiled Array whosetiles are in-turn HTAs or Leaf HTAs.Leaf HTA: HTA with no further levels of tilingHierarchically Tiled Arrays (HTA)410/26/07 HTA 7Hierarchically Tiled ArraysAn example tiling2 X 2 tiles4 X 4 tiles2 X 2 tiles10/26/07 HTA 8Hierarchically Tiled Arrays2 X 2 tilessize = 64 X 64 elements4 X 4 tilessize = 32 X 32 elements2 X 2 tilessize = 8 X 8 elementsAn example size510/26/07 HTA 9Hierarchically Tiled Arrays2 X 2 tilessize = 64 X 64 elementsmap to distinct modulesof a cluster4 X 4 tilessize = 32 X 32 elementsmap to L1-cache2 X 2 tilessize = 8 X 8 elementsmap to registersAn example mapping10/26/07 HTA 10Why tiles as first class objects? Many algorithms are formulated in terms oftiles e.g. linear algebra algorithms (in LAPACK,ATLAS) What is special about HTAs, in particular? HTA provides a convenient way for theprogrammer to express these types of algorithms Its recursive nature makes it easy to fit in to anymemory hierarchy.610/26/07 HTA 11hta(MATRIX,{[delim1],[delim2],…},[processor mesh])P1P3P2P4aha = hta(a,{[1,4],[1,3]},[2,2]);haprocessormappingHTA Construction141310/26/07 HTA 12HTA Addressing710/26/07 HTA 13h{1,1:2} h{2,1} hierarchical HTA Addressingtiles10/26/07 HTA 14h{1,1:2} h{2,1} hierarchical h(3,4) flattenedHTA Addressingelements810/26/07 HTA 15h{1,1:2} h{2,1} hierarchical h{2,2}(1,2) hybridh(3,4) flattenedorHTA Addressing10/26/07 HTA 16F90 Conformabilitydim(lhs) == dim(rhs).and.size(lhs) == size(rhs)size(rhs) == 1910/26/07 HTA 17 HTA ConformabilityAll conformable HTAs can be operated using the primitive operations(add, subtract, etc) 10/26/07 HTA 18HTA Assignmenth{:,:} = t{:,:}h{1,:} = t{2,:}h{1,:}(2,:) = t{2,:}(1,:);orh(2,:) = t(3,:); htimplicit communication1010/26/07 HTA 19Data parallel functions in F90sinsin( )sin( )sin( ) sin( )sin( )sin( )sin( )sin( )sin( )10/26/07 HTA 20Mapr = map (@sin, h)Recursive@sin1110/26/07 HTA 21@sin @sin@sinMapr = map (@sin, h)@sinRecursive@sin10/26/07 HTA 22@sin@sin@sin@sin @sin@sin@sin@sin @sin @sin@sin@sinMapr = map (@sin, h)Recursive@sin@sin@sin@sin1210/26/07 HTA 23@max@max@max@max @max@max@max@max @max @max@max@maxReducer = reduce (@max, h)Recursive@max10/26/07 HTA 24Reducer = reduce (@max, h)@max@max@maxRecursive@max@max@max@max @max@max@max@max @max @max@max@max@max1310/26/07 HTA 25Reducer = reduce (@max, h)Recursive@max@max@max@max@max10/26/07 HTA 26repmat(h, [1, 3])circshift( h, [0, -1] )transpose(h)Higher level operations1410/26/07 HTA 27Motivation Of Our ResearchHierarchically Tiled ArraysExamples and Additional FeaturesEvaluationConclusion & Future workOutline of the talk10/26/07 HTA 28Blocked Recursive Matrix Multiplicationfunction c = matmul (A, B, C)if (level(A) == 0)C = matmul_leaf (A, B, C)else for i = 1:size(A, 1)for k = 1:size(A, 2)for j = 1:size(B, 2)C{i, j} = matmul(A{i, k}, B{k, j}, C{i,j});endendendendBlocked-Recursive implementationA{i, k}, B{k, j} and C{i, j} are sub-matrices of A, B and C.1510/26/07 HTA 29Matrix Multiplicationmatmul (AA{i, k}, BB{k, j}, CC{i, j})matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j})matmul (A, B, C)matmul (A{i, k}, B{k, j}, C{i, j})10/26/07 HTA 30SUMMA Matrix MultiplicationOuter-product methodfor k=1:n for i=1:n for j=1:n C(i,j)=C(i,j)+A(i,k)*B(k,j); end end endfor k=1:n C(:,:)=C(:, :)+A(:,k) * B(k, :);end1610/26/07 HTA 31SUMMA Matrix MultiplicationT2BT1A*function C = summa (A, B, C)for k=1:m T1 = repmat(A{:, k}, 1, m); T2 = repmat(B{k, :}, m, 1); C = C + T1 * T2; endHere, C{i,j}, A{i,k}, B{k,j}represent submatrices.The * operator representstile-by-tile matrixmultiplicationrepmatrepmat10/26/07 HTA 32Stencil Computations A common computation is a stencil – wherean element’s value is based on those of itselfand its neighbors. Let element X equal the average of itself andits neighbors A(X) = 1/3 * ( A(X-1) + A(X) + A(X+1) )1710/26/07 HTA 33Stencil Computations If tiles are distributed, communication isnecessary to read the value of neighbors atthe tile boundaries One solution: Shadow Regions10/26/07 HTA 34Shadow Regions Insert “padding” cells at the borders of tiles tohold the adjacency data Requires manual updates by the programmerto remain consistent The programmer overhead of this approachleads to a feature of HTAs, Overlapped Tiling1810/26/07 HTA 35Overlapped Tiling Overlapped HTAs are regular HTAs with afew exceptions: At creation, the programmer specifies the amountof overlap in each dimension Overlapped HTAs provide the functionality ofshadow regions without any manual updating bythe programmer The library handles any communication in thecorrect manner10/26/07 HTA 36StencilWith Explicit Shadow RegionsDo X=1,MA{1:N-1}(6) = A{2:N}(2)A{2:N}{1} = A{1:N-1}(5)A(X) = 1/3 * ( A(X-1) +A(X) +A(X+1) )EnddoWith Overlapped TilingDo X=1,MA(X) = 1/3 * ( A(X-1) +A(X) +A(X+1) )Enddo1910/26/07 HTA 37Dynamic Partitioning Another feature of HTAs is dynamicpartitioning Normally, the tiling structure of a HTA is setat creation; however, the partitioning forsome problems might be input dependent. Dynamic partitioning allows the programmerto: Add tiling to flat arrays Turn tiled arrays into
View Full Document