Programming for Parallelism and Locality with Hierarchically Tiled Arrays Ganesh Bikshandi Jia Guo Daniel Hoeflinger Gheorghe Almasi Basilio B Fraguela Maria J Garzaran David Padua Christoph von Praun University of Illinois Urbana Champaign IBM T J Watson Research Center Universidade de Coruna Motivation Memory Hierarchy Determines performance Increasingly complex Registers Cache L1 L3 Memory Distributed Memory Gap between the programs and their mapping into the memory hierarchy Given the importance of memory hierarchy programming constructs that allow the control of locality are necessary 01 15 19 HTA 2 Contributions Design of a programming paradigm that facilitates control of locality in sequential and parallel programs communication in parallel programs Implementations in MATLAB C and X10 Implementation of NAS benchmarks in our paradigm Evaluation of productivity and performance 01 15 19 HTA 3 Outline of the talk Motivation Of Our Research Hierarchically Tiled Arrays Examples Evaluation Conclusion Future work 01 15 19 HTA 4 Hierarchically Tiled Arrays HTA Tiled Array Array partitioned into tiles Hierarchically Tiled Array Tiled Array whose tiles are in turn HTAs 01 15 19 HTA 5 Hierarchically Tiled Arrays An example tiling 2 X 2 tiles 4 X 4 tiles 2 X 2 tiles 01 15 19 HTA 6 Hierarchically Tiled Arrays An example size 2 X 2 tiles size 64 X 64 elements 4 X 4 tiles size 32 X 32 elements 2 X 2 tiles size 8 X 8 elements 01 15 19 HTA 7 Hierarchically Tiled Arrays An example mapping 2 X 2 tiles size 64 X 64 elements map to distinct modules of a cluster 4 X 4 tiles size 32 X 32 elements map to L1 cache 2 X 2 tiles size 8 X 8 elements map to registers 01 15 19 HTA 8 Why tiles as first class objects Many algorithms are formulated in terms of tiles e g linear algebra algorithms in LAPACK ATLAS What is special about HTAs in particular HTA provides a convenient way for the programmer to express these types of algorithms Its recursive nature makes it easy to fit in to any memory hierarchy 01 15 19 HTA 9 HTA Construction hta MATRIX delim1 delim2 processor mesh a 1 1 3 4 01 15 19 P1 P3 P2 P4 processor mapping ha ha hta a 1 4 1 3 2 2 HTA 10 HTA Addressing 01 15 19 HTA 11 HTA Addressing tiles h 1 1 2 h 2 1 hierarchical 01 15 19 HTA 12 HTA Addressing elements h 1 1 2 h 3 4 h 2 1 flattened hierarchical 01 15 19 HTA 13 HTA Addressing h 1 1 2 h 3 4 h 2 1 flattened or hierarchical h 2 2 1 2 hybrid 01 15 19 HTA 14 F90 Conformability size rhs 1 dim lhs dim rhs and size lhs size rhs 01 15 19 HTA 15 HTA Conformability All conformable HTAs can be operated using the primitive operations add subtract etc 01 15 19 HTA 16 HTA Assignment h t h t h 1 t 2 h 1 2 t 2 1 implicit communication 01 15 19 HTA 17 Data parallel functions in F90 sin sin sin sin sin sin sin sin sin sin 01 15 19 HTA 18 Map r map sin h sin Recursive 01 15 19 HTA 19 Map r map sin h sin sin sin sin sin Recursive 01 15 19 HTA 20 Map r map sin h sin sin sin sin sin sin sin sin sin sin sin sin sin sin sin sin Recursive 01 15 19 HTA 21 Reduce r reduce max h max max max max max max max max max max max max max Recursive 01 15 19 HTA 22 Reduce r reduce max h max max max max max max max max max max max max max max max max Recursive 01 15 19 HTA 23 Reduce r reduce max h max max max max max Recursive 01 15 19 HTA 24 Higher level operations repmat h 1 3 circshift h 0 1 transpose h 01 15 19 HTA 25 Outline of the talk Motivation Of Our Research Hierarchically Tiled Arrays Examples Evaluation Conclusion Future work 01 15 19 HTA 26 Blocked Recursive Matrix Multiplication Blocked Recursive implementation A i k B k j and C i j are sub matrices of A B and C function c matmul A B C if level A 0 C matmul leaf A B C else for i 1 size A 1 for k 1 size A 2 for j 1 size B 2 C i j matmul A i k B k j C i j end end end end 01 15 19 HTA 27 Matrix Multiplicationmatmul A B C matmul A i k B k j C i j matmul AA i k BB k j CC i j matmul leaf AAA i k BBB k j CCC i j 01 15 19 HTA 28 Matrix Multiplicationmatmul A B C matmul A i k B k j C i j matmul AA i k BB k j CC i j matmul leaf AAA i k BBB k j CCC i j 01 15 19 for i 1 size A 1 for k 1 size A 2 for j 1 size B 2 C i j C i j A i k B k j end end end HTA 29 SUMMA Matrix Multiplication Outer product method for k 1 n for i 1 n for j 1 n C i j C i j A i k B k j end end end for k 1 n C C A k B k end 01 15 19 HTA 30 SUMMA Matrix Multiplication function C summa A B C for k 1 m T1 repmat A k 1 m T2 repmat B k m 1 C C T1 T2 end Here C i j A i k B k j represent submatrices The operator represents tile by tile matrix multiplication B A 01 15 19 repmat repmat T1 HTA T2 31 Outline of the talk Motivation Of Our Research Hierarchically Tiled Arrays Examples Evaluation Conclusion Future work 01 15 19 HTA 32 Implementation MATLAB Extension with HTA X10 Extension Global view single threaded Front end is pure MATLAB MPI calls using MEX interface MATLAB library routines at the leaf level Emulation only no experimental results C extension with HTA Under progress Communication using MPI UPC 01 15 19 HTA 33 Evaluation Implemented the full set of NAS benchmarks except SP Pure MATLAB MATLAB HTA Performance metrics running time relative speedup on 1 128 processors Productivity metric source lines of code Compared with hand optimized F77 MPI version Tiles only used for parallelism Matrix matrix multiplication using HTAs in C Tiles used for locality 01 15 19 HTA 34 Illustrations MG 3D Stencil convolution primitive HTA operations and assignments to implement communication restrict 01 15 19 interpolate HTA 36 MG r 1 2 r 2 1 communication r 2 n 1 2 n 1 0 5 u 2 n 1 2 n 1 0 25 u 1 n 2 2 n 1 u 3 n 2 n 1 u 2 n 1 1 n 2 u 2 …
View Full Document
Unlocking...