PowerPoint PresentationOutlineUnified Parallel C (UPC)Shared Arrays and Pointers in UPCAccessing Shared Memory in UPCUPC Programming Model FeaturesOverview of the Berkeley UPC CompilerA Layered DesignSlide 9The Cray X1 ArchitectureSlide 11GASNet Communication System- ArchitectureGASNet Extended API – Remote memory operationsGASNet and Cray X1 Remote memory operationsGASNet/X1 PerformanceSlide 16Shared Pointer RepresentationsCost of Shared Pointer Arithmetic and AccessesSlide 19Serial PerformanceLivermore Loop KernelsEvaluating Communication Optimizations on Cray X1NAS CG: OpenMP style vs. MPI styleMore OptimizationsConclusionUnified Parallel C at LBNL/UCBImplementing a Global Address Space Language on the Cray X1:the Berkeley UPC ExperienceChristian Bell and Wei ChenCS252 Class ProjectDecember 10, 2003Unified Parallel C at LBNL/UCBOutline•An Overview of UPC and the Berkeley UPC Compiler•Overview of the Cray X1•Implementing the GASNet layer on the X1•Implementing the runtime layer on the X1•Serial performance•Evaluation of compiler optimizationsUnified Parallel C at LBNL/UCBUnified Parallel C (UPC)•UPC is an explicitly parallel global address space language with SPMD parallelism-An extension of ISO C-User level shared memory, partitioned by threads-One-sided (bulk and fine-grained) communication through reads/writes of shared variablesSharedGlobal address spaceX[0]PrivateX[1]X[P]Unified Parallel C at LBNL/UCBShared Arrays and Pointers in UPC•Cyclic shared int A[n];•Block Cyclic shared [2] int B[n];•Indefinite shared [0] int * C = (shared [0] int *) upc_alloc(n);•Use pointer-to-shared to access shared data-Block size part of the pointer type-A generic pointer-to-shared contains:-Address, Thread id, Phase-Cyclic and Indefinite pointers are phaseless A[0] A[2] A[4] …B[0] B[1] B[4] B[5]…C[0] C[1] C[2] …A[1] A[3] A[5] …B[2] B[3] B[6] B[7]…T0T1Unified Parallel C at LBNL/UCBThread 1 Thread N -1Address Thread Phase0 2addrPhaseSharedMemory Thread 0block sizestart of array object… … Accessing Shared Memory in UPCstart of blockUnified Parallel C at LBNL/UCBUPC Programming Model Features•Block cyclically distributed arrays•Shared and private pointers•Global synchronization -- barriers•Pair-wise synchronization – locks•Parallel loops•Dynamic shared memory allocation•Bulk Shared Memory accesses•Strict vs. Relaxed memory consistency modelsUnified Parallel C at LBNL/UCBOverview of the Berkeley UPC CompilerTranslatorUPC CodeTranslator Generated C CodeBerkeley UPC Runtime SystemGASNet Communication SystemNetwork HardwarePlatform-independentNetwork-independentCompiler-independentLanguage-independentTwo Goals: Portability and High-PerformanceLower UPC code into ANSI-C codeShared Memory Management and pointer operationsUniform get/put interface for underlying networksUnified Parallel C at LBNL/UCBA Layered Design•Portable: -C is our intermediate language-Can run on top of MPI (with performance penalty)-GASNet has a layered design with a small core•High-Performance: -Native C compiler optimizes serial code-Translator can perform high-level communication optimizations-GASNet can access network hardware directly, provides a rich set of communication / synchronization primitivesUnified Parallel C at LBNL/UCBOutline•An Overview of UPC and the Berkeley UPC Compiler•Overview of the Cray X1•Implementing the GASNet layer on the X1•Implementing the runtime layer on the X1•Serial performance•Evaluation of compiler optimizationsUnified Parallel C at LBNL/UCBThe Cray X1 Architecture• All Gets/Puts must be loads/stores (directly or shmem interface)• Only puts are “non-blocking”, gets are blocking• Vectorization is crucial• Vector pipeline 2x faster than scalar• Utilization of memory bandwidth• Strided accesses, scatter-gather, reduction, etc.• New line of Vector Architecture• Two modes of operation• SSP up to 16 CPUs/node• MSP multistreams long loops• Single-node UMA, multi-node NUMA (no caching remote data)• Global pointers• Low latency, high bandwidthUnified Parallel C at LBNL/UCBOutline•An Overview of UPC and the Berkeley UPC Compiler•Overview of the Cray X1•Implementing the GASNet layer on the X1•Implementing the runtime layer on the X1•Serial performance•Evaluation of compiler optimizationsUnified Parallel C at LBNL/UCBGASNet Communication System- Architecture2-Level architecture to ease implementation:•Core API-Based heavily on Active Messages-Compatibility layer-Port to X1 in 2 days, new algorithm to manipulate queues in Shared Memory•Extended API-Wider interface that includes more complicated operations (puts, gets)-A reference implementation of the extended API in terms of the core API is provided-Current revision is tuned especially for the X1 with shared memory as the primary focus (minimal overhead)Compiler-generated codeCompiler-specific runtime systemGASNet Extended APIGASNet Core APINetwork HardwareUnified Parallel C at LBNL/UCBGASNet Extended API – Remote memory operations•GASNet offers expressive Put/Get primitives-All Gets/Puts can be blocking and non-blocking-Non-blocking can be explicit (handle-based)-Non-blocking can be implicit (global or region-based)-Synchronization can poll or block-Paves the way for complex split-phase communication (compiler optimizations)•Cray X1 uses exclusively shared memory-All Gets/Puts must be loads/stores-Only puts are “non-blocking”, gets are blocking-Very limited synchronization mechanisms-Efficient communication only through vectors (one order of magnitude between scalar and vector communication)-Vectorization instead of split-phase?Unified Parallel C at LBNL/UCBGASNet Cray X1 Instruction CommentBulk operations Vector bcopy() Fully vectorized, suitable for GASNet/UPCNon-bulk blocking puts Store + gsync No vectorizationNon-bulk blocking gets LoadNon-bulk Non-blocking explicit puts/getsStore/load + gsync No vectorization if sync done in the loopNon-bulk Non-blocking implicit puts/getsStore/load + gsync No vectorization if sync done in the loopGASNet and Cray X1 Remote memory operations• Flexible communications provides no benefit without vectorization (factor of 10 between vector and scalar)• Difficult to expose vectorization through a layered software stack: Native C compiler now has to optimize parallel code!• Cray X1 “big hammer” gsync() prevents interesting communication optimizationsUnified Parallel C at
View Full Document