DOC PREVIEW
Berkeley COMPSCI 252 - Implementing a Global Address Space Language

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

PowerPoint PresentationOutlineUnified Parallel C (UPC)Shared Arrays and Pointers in UPCAccessing Shared Memory in UPCUPC Programming Model FeaturesOverview of the Berkeley UPC CompilerA Layered DesignSlide 9The Cray X1 ArchitectureSlide 11GASNet Communication System- ArchitectureGASNet Extended API – Remote memory operationsGASNet and Cray X1 Remote memory operationsGASNet/X1 PerformanceSlide 16Shared Pointer RepresentationsCost of Shared Pointer Arithmetic and AccessesSlide 19Serial PerformanceLivermore Loop KernelsEvaluating Communication Optimizations on Cray X1NAS CG: OpenMP style vs. MPI styleMore OptimizationsConclusionUnified Parallel C at LBNL/UCBImplementing a Global Address Space Language on the Cray X1:the Berkeley UPC ExperienceChristian Bell and Wei ChenCS252 Class ProjectDecember 10, 2003Unified Parallel C at LBNL/UCBOutline•An Overview of UPC and the Berkeley UPC Compiler•Overview of the Cray X1•Implementing the GASNet layer on the X1•Implementing the runtime layer on the X1•Serial performance•Evaluation of compiler optimizationsUnified Parallel C at LBNL/UCBUnified Parallel C (UPC)•UPC is an explicitly parallel global address space language with SPMD parallelism-An extension of ISO C-User level shared memory, partitioned by threads-One-sided (bulk and fine-grained) communication through reads/writes of shared variablesSharedGlobal address spaceX[0]PrivateX[1]X[P]Unified Parallel C at LBNL/UCBShared Arrays and Pointers in UPC•Cyclic shared int A[n];•Block Cyclic shared [2] int B[n];•Indefinite shared [0] int * C = (shared [0] int *) upc_alloc(n);•Use pointer-to-shared to access shared data-Block size part of the pointer type-A generic pointer-to-shared contains:-Address, Thread id, Phase-Cyclic and Indefinite pointers are phaseless A[0] A[2] A[4] …B[0] B[1] B[4] B[5]…C[0] C[1] C[2] …A[1] A[3] A[5] …B[2] B[3] B[6] B[7]…T0T1Unified Parallel C at LBNL/UCBThread 1 Thread N -1Address Thread Phase0 2addrPhaseSharedMemory Thread 0block sizestart of array object… … Accessing Shared Memory in UPCstart of blockUnified Parallel C at LBNL/UCBUPC Programming Model Features•Block cyclically distributed arrays•Shared and private pointers•Global synchronization -- barriers•Pair-wise synchronization – locks•Parallel loops•Dynamic shared memory allocation•Bulk Shared Memory accesses•Strict vs. Relaxed memory consistency modelsUnified Parallel C at LBNL/UCBOverview of the Berkeley UPC CompilerTranslatorUPC CodeTranslator Generated C CodeBerkeley UPC Runtime SystemGASNet Communication SystemNetwork HardwarePlatform-independentNetwork-independentCompiler-independentLanguage-independentTwo Goals: Portability and High-PerformanceLower UPC code into ANSI-C codeShared Memory Management and pointer operationsUniform get/put interface for underlying networksUnified Parallel C at LBNL/UCBA Layered Design•Portable: -C is our intermediate language-Can run on top of MPI (with performance penalty)-GASNet has a layered design with a small core•High-Performance: -Native C compiler optimizes serial code-Translator can perform high-level communication optimizations-GASNet can access network hardware directly, provides a rich set of communication / synchronization primitivesUnified Parallel C at LBNL/UCBOutline•An Overview of UPC and the Berkeley UPC Compiler•Overview of the Cray X1•Implementing the GASNet layer on the X1•Implementing the runtime layer on the X1•Serial performance•Evaluation of compiler optimizationsUnified Parallel C at LBNL/UCBThe Cray X1 Architecture• All Gets/Puts must be loads/stores (directly or shmem interface)• Only puts are “non-blocking”, gets are blocking• Vectorization is crucial• Vector pipeline 2x faster than scalar• Utilization of memory bandwidth• Strided accesses, scatter-gather, reduction, etc.• New line of Vector Architecture• Two modes of operation• SSP up to 16 CPUs/node• MSP multistreams long loops• Single-node UMA, multi-node NUMA (no caching remote data)• Global pointers• Low latency, high bandwidthUnified Parallel C at LBNL/UCBOutline•An Overview of UPC and the Berkeley UPC Compiler•Overview of the Cray X1•Implementing the GASNet layer on the X1•Implementing the runtime layer on the X1•Serial performance•Evaluation of compiler optimizationsUnified Parallel C at LBNL/UCBGASNet Communication System- Architecture2-Level architecture to ease implementation:•Core API-Based heavily on Active Messages-Compatibility layer-Port to X1 in 2 days, new algorithm to manipulate queues in Shared Memory•Extended API-Wider interface that includes more complicated operations (puts, gets)-A reference implementation of the extended API in terms of the core API is provided-Current revision is tuned especially for the X1 with shared memory as the primary focus (minimal overhead)Compiler-generated codeCompiler-specific runtime systemGASNet Extended APIGASNet Core APINetwork HardwareUnified Parallel C at LBNL/UCBGASNet Extended API – Remote memory operations•GASNet offers expressive Put/Get primitives-All Gets/Puts can be blocking and non-blocking-Non-blocking can be explicit (handle-based)-Non-blocking can be implicit (global or region-based)-Synchronization can poll or block-Paves the way for complex split-phase communication (compiler optimizations)•Cray X1 uses exclusively shared memory-All Gets/Puts must be loads/stores-Only puts are “non-blocking”, gets are blocking-Very limited synchronization mechanisms-Efficient communication only through vectors (one order of magnitude between scalar and vector communication)-Vectorization instead of split-phase?Unified Parallel C at LBNL/UCBGASNet Cray X1 Instruction CommentBulk operations Vector bcopy() Fully vectorized, suitable for GASNet/UPCNon-bulk blocking puts Store + gsync No vectorizationNon-bulk blocking gets LoadNon-bulk Non-blocking explicit puts/getsStore/load + gsync No vectorization if sync done in the loopNon-bulk Non-blocking implicit puts/getsStore/load + gsync No vectorization if sync done in the loopGASNet and Cray X1 Remote memory operations• Flexible communications provides no benefit without vectorization (factor of 10 between vector and scalar)• Difficult to expose vectorization through a layered software stack: Native C compiler now has to optimize parallel code!• Cray X1 “big hammer” gsync() prevents interesting communication optimizationsUnified Parallel C at


View Full Document

Berkeley COMPSCI 252 - Implementing a Global Address Space Language

Documents in this Course
Quiz

Quiz

9 pages

Caches I

Caches I

46 pages

Lecture 6

Lecture 6

36 pages

Lecture 9

Lecture 9

52 pages

Figures

Figures

26 pages

Midterm

Midterm

15 pages

Midterm

Midterm

14 pages

Midterm I

Midterm I

15 pages

ECHO

ECHO

25 pages

Quiz  1

Quiz 1

12 pages

Load more
Download Implementing a Global Address Space Language
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Implementing a Global Address Space Language and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Implementing a Global Address Space Language 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?