DOC PREVIEW
UCLA COMSCI 239 - Automatic Support for Irregular Computations in a High-Level Language

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Automatic Support for Irregular Computations in a High-Level LanguageJimmy Su and Katherine YelickComputer Science Division, University of California at Berkeley{jimmysu,yelick}@cs.berkeley.eduAbstractThe problem of writing high performance parallelapplications becomes even more challenging whenirregular, sparse or adaptive methods are employed.In this paper we introduce compiler and runtimesupport for programs with indirect array accesses intoTitanium, a high-level language that combines anexplicit SPMD parallelism model with implicitcommunication through a global shared addressspace. By combining the well-known inspector-executor technique with high level multi-dimensionalarray constructs, compiler analysis and performancemodeling, we demonstrate optimizations that areentirely hidden from the programmer. The globaladdress space makes the programs easier to write thanin message passing, with remote array accesses usedin place of explicit messages with data packing andunpacking. The programs are also faster thanmessage passing programs: Using sparse matrix-vector multiplication programs, we show that theTitanium code is an average of 21% faster acrossseveral matrices and machines, with the best casespeedup more than a factor of 2x. The performanceadvantages are due to both the lightweight RDMA(Remote Direct Memory Access) communication modelthat underlies the Titanium implementation andautomatic optimization selection that adapts thecommunication to the machine and workload, in somecases using different communication models fordifferent processors within a single computation.1. IntroductionApplication scientists increasingly employirregular data structures such as unstructured grids,particle-mesh structures, adaptive meshes and sparsematrices in an effort to obtain more computationallyefficient methods. These methods are not wellsupported by popular high performance programmingmodels, because they lead to indirect array accesses,pointer-based data structures, and communication thatis unpredictable in both timing and volume. Titanium[23] is a JavaTM-based language designed for highperformance computing, with a shared address space toease programming of pointer-based data structures anda powerful multidimensional array abstraction that hasproven useful in both adaptive and non-adaptive block-structured algorithms. In this paper we propose new compiler andruntime extensions for the Titanium implementation tosupport programs with indirect array accesses, such asA[B[i]], where the A array may live in a remoteprocessor’s memory. These computations arise insparse iterative solvers, particle-mesh methods, andelsewhere. We add compiler support for an inspector-executor execution model, which optimizescommunication performing runtime optimization basedon the dynamic pattern of indices in the indirectionarray, which is B in the previous example. There areseveral possible transformations that can be done onthe communication, and we consider three in thispaper: sending the entire remote array, sending exactlythose elements that are needed, and sending a boundingbox of the required values. While packing isguaranteed to send the minimal number of actualvalues, it has a higher metadata overhead and istherefore not necessarily optimal. One of thechallenges is to select the best communicationtransformation, because it depends on properties of theapplication and the machine. We introduce a simpleanalytical performance model into the compiler, whichselects optimizations automatically.We analyze the benefits of the automatedinspector-executor transformation using sparse matrix-vector multiplication on a set of matrices from realapplications. Our results show that although theprogram is significantly simpler when written inTitanium, because it avoids the explicit communicationcode and the pack and unpack code, the performance isalmost always superior to a popular MPI messagepassing code. The speedup relative to MPI on a suiteof over 20 matrices averages 21% on three differentmachines, with the maximum speedup of more than 2x.The model-based optimization selection is critical toboth programmability and performance. Not only doesthe model select optimizations that differ by matrix andmachine, but also differ between processors within asingle matrix-vector multiplication. This runtimecomplexity is entirely hidden from the programmer,making the application both cleaner and faster.2. Related WorkThe idea of inspector executor optimizationsfor scientific codes is not new. Walker proposed andimplemented the idea of using a pre-computedcommunication schedule for indirect array accesses todistributed arrays in a Particle-In-Cell application [20].The idea is widely used in application codes today,where it is programmed manually. Use of the techniquewith compiler and library support was developed byBerryman and Saltz. The PARTI runtime library [2]and its successor CHAOS [14] provided primitives forapplication programmers to apply the inspectorexecutor optimization on the source level. The sameresearch group provided the dataflow framework todetermine where a communication schedule can begenerated, where communication operations are placed,and when schedules can be combined [9]. As anexperimental result, they manually carried out theoptimizations that would have been suggested by thedataflow framework. The ARF [18] and KALI [11]compilers were able to automatically generate inspectorexecutor pairs for simply nested loops. Slicing analysiswas developed to extend the inspector executorparadigm to multiple level of indirection [6]. Morerecently, the inspector executor technique was used todevelop runtime reordering of data and computationthat enhance memory locality in applications withsparse data structures [15]. Benkner [3]


View Full Document

UCLA COMSCI 239 - Automatic Support for Irregular Computations in a High-Level Language

Download Automatic Support for Irregular Computations in a High-Level Language
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Automatic Support for Irregular Computations in a High-Level Language and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Automatic Support for Irregular Computations in a High-Level Language 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?