UCLA COMSCI 239 - Automatic Support for Irregular Computations in a High-Level Language - D2772339

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 239> Automatic Support for Irregular Computations in a High-Level Language

DOC PREVIEW

UCLA COMSCI 239 - Automatic Support for Irregular Computations in a High-Level Language

School name University of California, Los Angeles

Course Comsci 239- Current Topics in Computer Science: Programming Languages and Systems

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Automatic Support for Irregular Computations in a High-Level LanguageJimmy Su and Katherine YelickComputer Science Division, University of California at Berkeley{jimmysu,yelick}@cs.berkeley.eduAbstractThe problem of writing high performance parallelapplications becomes even more challenging whenirregular, sparse or adaptive methods are employed.In this paper we introduce compiler and runtimesupport for programs with indirect array accesses intoTitanium, a high-level language that combines anexplicit SPMD parallelism model with implicitcommunication through a global shared addressspace. By combining the well-known inspector-executor technique with high level multi-dimensionalarray constructs, compiler analysis and performancemodeling, we demonstrate optimizations that areentirely hidden from the programmer. The globaladdress space makes the programs easier to write thanin message passing, with remote array accesses usedin place of explicit messages with data packing andunpacking. The programs are also faster thanmessage passing programs: Using sparse matrix-vector multiplication programs, we show that theTitanium code is an average of 21% faster acrossseveral matrices and machines, with the best casespeedup more than a factor of 2x. The performanceadvantages are due to both the lightweight RDMA(Remote Direct Memory Access) communication modelthat underlies the Titanium implementation andautomatic optimization selection that adapts thecommunication to the machine and workload, in somecases using different communication models fordifferent processors within a single computation.1. IntroductionApplication scientists increasingly employirregular data structures such as unstructured grids,particle-mesh structures, adaptive meshes and sparsematrices in an effort to obtain more computationallyefficient methods. These methods are not wellsupported by popular high performance programmingmodels, because they lead to indirect array accesses,pointer-based data structures, and communication thatis unpredictable in both timing and volume. Titanium[23] is a JavaTM-based language designed for highperformance computing, with a shared address space toease programming of pointer-based data structures anda powerful multidimensional array abstraction that hasproven useful in both adaptive and non-adaptive block-structured algorithms. In this paper we propose new compiler andruntime extensions for the Titanium implementation tosupport programs with indirect array accesses, such asA[B[i]], where the A array may live in a remoteprocessor’s memory. These computations arise insparse iterative solvers, particle-mesh methods, andelsewhere. We add compiler support for an inspector-executor execution model, which optimizescommunication performing runtime optimization basedon the dynamic pattern of indices in the indirectionarray, which is B in the previous example. There areseveral possible transformations that can be done onthe communication, and we consider three in thispaper: sending the entire remote array, sending exactlythose elements that are needed, and sending a boundingbox of the required values. While packing isguaranteed to send the minimal number of actualvalues, it has a higher metadata overhead and istherefore not necessarily optimal. One of thechallenges is to select the best communicationtransformation, because it depends on properties of theapplication and the machine. We introduce a simpleanalytical performance model into the compiler, whichselects optimizations automatically.We analyze the benefits of the automatedinspector-executor transformation using sparse matrix-vector multiplication on a set of matrices from realapplications. Our results show that although theprogram is significantly simpler when written inTitanium, because it avoids the explicit communicationcode and the pack and unpack code, the performance isalmost always superior to a popular MPI messagepassing code. The speedup relative to MPI on a suiteof over 20 matrices averages 21% on three differentmachines, with the maximum speedup of more than 2x.The model-based optimization selection is critical toboth programmability and performance. Not only doesthe model select optimizations that differ by matrix andmachine, but also differ between processors within asingle matrix-vector multiplication. This runtimecomplexity is entirely hidden from the programmer,making the application both cleaner and faster.2. Related WorkThe idea of inspector executor optimizationsfor scientific codes is not new. Walker proposed andimplemented the idea of using a pre-computedcommunication schedule for indirect array accesses todistributed arrays in a Particle-In-Cell application [20].The idea is widely used in application codes today,where it is programmed manually. Use of the techniquewith compiler and library support was developed byBerryman and Saltz. The PARTI runtime library [2]and its successor CHAOS [14] provided primitives forapplication programmers to apply the inspectorexecutor optimization on the source level. The sameresearch group provided the dataflow framework todetermine where a communication schedule can begenerated, where communication operations are placed,and when schedules can be combined [9]. As anexperimental result, they manually carried out theoptimizations that would have been suggested by thedataflow framework. The ARF [18] and KALI [11]compilers were able to automatically generate inspectorexecutor pairs for simply nested loops. Slicing analysiswas developed to extend the inspector executorparadigm to multiple level of indirection [6]. Morerecently, the inspector executor technique was used todevelop runtime reordering of data and computationthat enhance memory locality in applications withsparse data structures [15]. Benkner [3]

View Full Document