DOC PREVIEW
U of I CS 498 - Programming for Parallelism

This preview shows page 1-2-3-26-27-28 out of 28 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 28 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Programming for Parallelismand Locality withHierarchically Tiled ArraysProgram Optimization Fall 07University of Illinois,Urbana-ChampaignPeople working in this project:Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger,Gheorghe Almasi*, Basilio B.Fraguela+, Maria J.Garzaran, David Padua, Christoph von Praun*University of Illinois,Urbana-Champaign,*IBM T. J. Watson Research Center,+Universidade de Coruna, Spain210/26/07 HTA 3Motivation Memory Hierarchy Determines performance Increasingly complex Registers – Cache (L1..L3) – Memory - Distributed Memory Given the importance of memory hierarchy,programming constructs that allow the control oflocality are necessary.10/26/07 HTA 4Contributions Design of a programming paradigm that facilitates control of locality (in sequential and parallel programs) communication (in parallel programs) Implementations in MATLAB and C++ Distributed Memory (Clusters) Shared Memory (Multi-core) Implementation of NAS benchmarks in our paradigm Evaluation of productivity and performance310/26/07 HTA 5Motivation Of Our ResearchHierarchically Tiled ArraysExamples and Additional FeaturesEvaluationConclusion & Future workOutline of the talk10/26/07 HTA 6 Tiled Array: Array partitioned into tiles Hierarchically Tiled Array: Tiled Array whosetiles are in-turn HTAs or Leaf HTAs.Leaf HTA: HTA with no further levels of tilingHierarchically Tiled Arrays (HTA)410/26/07 HTA 7Hierarchically Tiled ArraysAn example tiling2 X 2 tiles4 X 4 tiles2 X 2 tiles10/26/07 HTA 8Hierarchically Tiled Arrays2 X 2 tilessize = 64 X 64 elements4 X 4 tilessize = 32 X 32 elements2 X 2 tilessize = 8 X 8 elementsAn example size510/26/07 HTA 9Hierarchically Tiled Arrays2 X 2 tilessize = 64 X 64 elementsmap to distinct modulesof a cluster4 X 4 tilessize = 32 X 32 elementsmap to L1-cache2 X 2 tilessize = 8 X 8 elementsmap to registersAn example mapping10/26/07 HTA 10Why tiles as first class objects? Many algorithms are formulated in terms oftiles e.g. linear algebra algorithms (in LAPACK,ATLAS) What is special about HTAs, in particular? HTA provides a convenient way for theprogrammer to express these types of algorithms Its recursive nature makes it easy to fit in to anymemory hierarchy.610/26/07 HTA 11hta(MATRIX,{[delim1],[delim2],…},[processor mesh])P1P3P2P4aha = hta(a,{[1,4],[1,3]},[2,2]);haprocessormappingHTA Construction141310/26/07 HTA 12HTA Addressing710/26/07 HTA 13h{1,1:2} h{2,1} hierarchical HTA Addressingtiles10/26/07 HTA 14h{1,1:2} h{2,1} hierarchical h(3,4) flattenedHTA Addressingelements810/26/07 HTA 15h{1,1:2} h{2,1} hierarchical h{2,2}(1,2) hybridh(3,4) flattenedorHTA Addressing10/26/07 HTA 16F90 Conformabilitydim(lhs) == dim(rhs).and.size(lhs) == size(rhs)size(rhs) == 1910/26/07 HTA 17 HTA ConformabilityAll conformable HTAs can be operated using the primitive operations(add, subtract, etc) 10/26/07 HTA 18HTA Assignmenth{:,:} = t{:,:}h{1,:} = t{2,:}h{1,:}(2,:) = t{2,:}(1,:);orh(2,:) = t(3,:); htimplicit communication1010/26/07 HTA 19Data parallel functions in F90sinsin( )sin( )sin( ) sin( )sin( )sin( )sin( )sin( )sin( )10/26/07 HTA 20Mapr = map (@sin, h)Recursive@sin1110/26/07 HTA 21@sin @sin@sinMapr = map (@sin, h)@sinRecursive@sin10/26/07 HTA 22@sin@sin@sin@sin @sin@sin@sin@sin @sin @sin@sin@sinMapr = map (@sin, h)Recursive@sin@sin@sin@sin1210/26/07 HTA 23@max@max@max@max @max@max@max@max @max @max@max@maxReducer = reduce (@max, h)Recursive@max10/26/07 HTA 24Reducer = reduce (@max, h)@max@max@maxRecursive@max@max@max@max @max@max@max@max @max @max@max@max@max1310/26/07 HTA 25Reducer = reduce (@max, h)Recursive@max@max@max@max@max10/26/07 HTA 26repmat(h, [1, 3])circshift( h, [0, -1] )transpose(h)Higher level operations1410/26/07 HTA 27Motivation Of Our ResearchHierarchically Tiled ArraysExamples and Additional FeaturesEvaluationConclusion & Future workOutline of the talk10/26/07 HTA 28Blocked Recursive Matrix Multiplicationfunction c = matmul (A, B, C)if (level(A) == 0)C = matmul_leaf (A, B, C)else for i = 1:size(A, 1)for k = 1:size(A, 2)for j = 1:size(B, 2)C{i, j} = matmul(A{i, k}, B{k, j}, C{i,j});endendendendBlocked-Recursive implementationA{i, k}, B{k, j} and C{i, j} are sub-matrices of A, B and C.1510/26/07 HTA 29Matrix Multiplicationmatmul (AA{i, k}, BB{k, j}, CC{i, j})matmul_leaf (AAA{i, k}, BBB{k, j}, CCC{i, j})matmul (A, B, C)matmul (A{i, k}, B{k, j}, C{i, j})10/26/07 HTA 30SUMMA Matrix MultiplicationOuter-product methodfor k=1:n for i=1:n for j=1:n C(i,j)=C(i,j)+A(i,k)*B(k,j); end end endfor k=1:n C(:,:)=C(:, :)+A(:,k) * B(k, :);end1610/26/07 HTA 31SUMMA Matrix MultiplicationT2BT1A*function C = summa (A, B, C)for k=1:m T1 = repmat(A{:, k}, 1, m); T2 = repmat(B{k, :}, m, 1); C = C + T1 * T2; endHere, C{i,j}, A{i,k}, B{k,j}represent submatrices.The * operator representstile-by-tile matrixmultiplicationrepmatrepmat10/26/07 HTA 32Stencil Computations A common computation is a stencil – wherean element’s value is based on those of itselfand its neighbors. Let element X equal the average of itself andits neighbors A(X) = 1/3 * ( A(X-1) + A(X) + A(X+1) )1710/26/07 HTA 33Stencil Computations If tiles are distributed, communication isnecessary to read the value of neighbors atthe tile boundaries One solution: Shadow Regions10/26/07 HTA 34Shadow Regions Insert “padding” cells at the borders of tiles tohold the adjacency data Requires manual updates by the programmerto remain consistent The programmer overhead of this approachleads to a feature of HTAs, Overlapped Tiling1810/26/07 HTA 35Overlapped Tiling Overlapped HTAs are regular HTAs with afew exceptions: At creation, the programmer specifies the amountof overlap in each dimension Overlapped HTAs provide the functionality ofshadow regions without any manual updating bythe programmer The library handles any communication in thecorrect manner10/26/07 HTA 36StencilWith Explicit Shadow RegionsDo X=1,MA{1:N-1}(6) = A{2:N}(2)A{2:N}{1} = A{1:N-1}(5)A(X) = 1/3 * ( A(X-1) +A(X) +A(X+1) )EnddoWith Overlapped TilingDo X=1,MA(X) = 1/3 * ( A(X-1) +A(X) +A(X+1) )Enddo1910/26/07 HTA 37Dynamic Partitioning Another feature of HTAs is dynamicpartitioning Normally, the tiling structure of a HTA is setat creation; however, the partitioning forsome problems might be input dependent. Dynamic partitioning allows the programmerto: Add tiling to flat arrays Turn tiled arrays into


View Full Document

U of I CS 498 - Programming for Parallelism

Documents in this Course
Lecture 5

Lecture 5

13 pages

LECTURE

LECTURE

39 pages

Assurance

Assurance

44 pages

LECTURE

LECTURE

36 pages

Pthreads

Pthreads

29 pages

Load more
Download Programming for Parallelism
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Programming for Parallelism and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Programming for Parallelism 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?