UCLA COMSCI 239 - Effective Automatic Parallelization of Stencil Computations - D1674218

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 239> Effective Automatic Parallelization of Stencil Computations

DOC PREVIEW

UCLA COMSCI 239 - Effective Automatic Parallelization of Stencil Computations

School name University of California, Los Angeles

Course Comsci 239- Current Topics in Computer Science: Programming Languages and Systems

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Effective Automatic Parallelization of Stencil ComputationsSriram Krishnamoorthy1Muthu Baskaran1Uday Bondhugula1J. Ramanujam2Atanas Rountev1P. Sadayappan11Dept. of Computer Science and Engineering2Dept. of Electrical & Computer Engg. andThe Ohio State University Center for Computation & Technology2015 Neil Ave. Columbus, OH, USA Louisiana State University{krishnsr,baskaran,bondhugu,rountev,saday}@cse.ohio-state.edu [email protected] optimization of stencil computations has beenwidely studied in the literature, since they occur in manycomputationally intensive scientific and engineering appli-cations. Compiler frameworks have also been developed thatcan transform sequential stencil codes for optimization ofdata locality and parallelism. However, loop skewing is typ-ically required in order to tile stencil codes along the timedimension, resulting in load imbalance in pipelined parallelexecution of the tiles. In this paper, we develop an approachfor automatic parallelization of stencil codes, that explicitlyaddresses the issue of load-balanced execution of tiles. Ex-perimental results are provided that demonstrate the effec-tiveness of the approach.Categories and Subject Descriptors D.3.4 [ProgrammingLanguages]: Processors—Compilers, OptimizationGeneral Terms Algorithms, PerformanceKeywords Stencil computations, Tiling, Automatic paral-lelization, Load balance1. IntroductionStencil computations represent a practically important classof computations that arise in many scientific/engineeringcodes. Computational domains that involve stencils in-clude those that use explicit time-integration methods fornumerical solution of partial differential equations (e.g.,climate/weather/ocean modeling [23], computational elec-tromagnetics codes using the Finite Difference Time Do-main method [27]), and multimedia/image-processing ap-plications that perform smoothing and other neighbor pixelbased computations [13]. There has been some prior workfrom the computer science community that has addressedPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.PLDI’07Jun 11–13, 2007, San Diego, California, USA.Copyrightc 2007 ACM 978-1-59593-633-2/07/0006. . . $5.00performance optimization of stencil computations (e.g.,[24, 19, 18, 10]). Since stencil computations are character-ized by a regular computational structure, they are amenableto automatic compile-time analysis and transformation forexploitation of parallelism and data locality optimization.However, as elaborated later through an example, existingcompiler frameworks have limitations in generating efficientcode optimized for parallelism and data locality.Loop tiling is the key transformation to enable paral-lelization and data-locality optimization of stencil codes.Much research has been published on tiling of iterationspaces [17, 29, 28, 26, 8, 25, 21, 22, 14, 7, 15, 9, 16, 3].With few exceptions (e.g. work of Griebl [11, 12]), re-search on performance optimization with tiling has gener-ally focused on one or the other of the two complemen-tary aspects: (a) data locality optimization [2, 3, 28, 26, 8];or (b) tile size/shape optimization for parallel execution[25, 21, 6, 14, 7, 15, 9, 16]. Tiling for data locality optimiza-tion involves maximization of data reuse, i.e., tiling alongdirections of the data dependence vectors. But such tilingmay result in inter-tile dependences that inhibit concurrentexecution of tiles on different processors. To the best ofour knowledge, no prior work has addressed in an integratedfashion, the issues of tiling for data locality optimization andload balancing for parallel execution. We first use the simpleexample of a one-dimensional Jacobi code to illustrate theproblem and introduce two approaches we propose to avoidthe problem: overlapped tiles and split tiles. As an exampleof a stencil computation, let us consider the one-dimensionalJacobi code shown in Figure 1. Optimizing this stencil com-putation for reduction of cache misses requires loop fusionand tiling; in order to fuse the two inner loops, loop skew-ing is needed. Frameworks have been previously proposedfor data locality optimizations of imperfectly nested loops.Using an approach proposed by Ahmed et al. [3, 4], theloop nest can be transformed into the one shown in Figure 2by first embedding the iterations in the imperfectly-nestedloops into a perfectly-nested iteration space. Loop trans-formations and tiling can then applied in the transformedperfectly-nested iteration space. The transformed iterationspace can subsequently translated into efficient code by re-ducing/eliminating the control overhead [20]. In this pa-per, we focus on load-balanced parallel execution of tiled235iteration spaces that have already been embedded into aperfectly-nested iteration space using a technique such asthat developed in [4].Figure 3 shows a single-statement form of the 1-D Jacobicode obtained by adding an additional dimension to array A.The flow dependences in this code are the same as that of thepreviously shown version, but there are no anti-dependences.Hence a single statement is sufficient in the loop body in-stead of a sequence of two statements, for update and copy,respectively, as seen in Figures 1 and 2. Although such amemory-inefficient code would not be used in practice, itis more convenient to use a single-statement iteration spacein explaining the main ideas in this paper. However, the de-veloped approach is not restricted to such single-statementloops, but is applicable to general multi-statement stencilcodes such as the one in Figure 1. The generalization ofthe approach for the more memory-efficient multi-statementversions of code is explained in the Appendix. The experi-mental results presented later also use the memory-efficientmulti-statement versions.The perfect loop nest of Figure 3 has constant depen-dences (1, 0), (1, 1), and (1, −1). Tiling for data reuse op-timization (e.g. using the approach presented in [2]) resultsin tiles of shape as shown in Figure 4. The horizontal axiscorresponds to the spatial dimension, with time along thevertical dimension. Using a sufficiently large

View Full Document