UMD CMSC 714 - Lecture 5 MPI vs. OpenMP and Titanium - D940966

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 714> Lecture 5 MPI vs. OpenMP and Titanium

DOC PREVIEW

UMD CMSC 714 - Lecture 5 MPI vs. OpenMP and Titanium

School name University of Maryland, College Park

Course Cmsc 714- High Performance Computing Systems

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CMSC 714 Lecture 5 MPI vs OpenMP and Titanium Alan Sussman Notes z z z MPI project due Friday 6PM Questions on debugging MPI programs Need volunteers to present papers Starting with Sisal programming language paper 1 week from today CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 2 1 OpenMP MPI z Some applications can take advantage of both message passing and threads Questions is what to do to obtain best overall performance without too much programming difficulty Choices are all MPI all OpenMP or both For both common option is outer loop parallelized with message passing inner loop with directives to generate threads z Applications studied Hydrology CGWAVE Computational chemistry GAMESS Linear algebra matrix multiplication and QR factorization Seismic processing SPECseis95 Computational fluid dynamics TLNS3D Computational physics CRETIN CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 3 Types of parallelism in the codes z For message passing parallelism MPI Parametric coarse grained outer loop essentially task parallel Structured domains domain decomposition with local operations structured and unstructured grids Direct solvers linear algebra lots of communication and load balancing required message passing works well for large systems of equations z Shared memory parallelism OpenMP Statically scheduled parallel loops one large or several smaller loops non nested parallel Parallel regions merge loops into one parallel region to reduce overhead of directives Dynamic load balanced when static scheduling leads to load imbalance from irregular task sizes CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 4 2 CGWAVE z z z z z z Finite elements MPI parameter space evaluation at outer loop OpenMP sparse linear equation solver in inner loops Speedup using 2 levels of parallelism allows modeling larger bodies of water possible in a reasonable amount of time Master worker strategy for dynamic load balancing in MPI part component Solver for each component solves large sparse linear system with OpenMP to parallelize On SGI Origin 2000 distributed shared memory machine use first touch rule to migrate data for each component to the processor that uses it Performance results show that best performance obtained using both MPI and OpenMP with a combination of MPI workers and OpenMP threads that depends on the problem grid size And for load balancing a lot fewer MPI workers than components CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 5 GAMESS z z Computational chemistry molecular dynamics MPI across cluster OpenMP within each node Built on top of Global Arrays package for distributed array operations Which in turn uses MPI paper says PVM and OpenMP z z z Linear algebra solvers mainly use OpenMP for dynamic scheduling and load balancing MPI versions of parts of code are complex but can provide higher performance for large problems Performance results on medium sized problem from SPEC Standard Performance Evaluation Corp are for a small system 4 8 processor Alpha processors connected by Memory Channel CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 6 3 Linear algebra z z z z z z Hybrid parallelism with MPI for scalability and OpenMP for load balancing for MM and QR factorization On IBM SP system with multiple 4 processor nodes Studies tradeoffs of hybrid approach for linear algebra algorithms vs only using MPI running 4 MPI processes per node Use OpenMP for load balancing and decreasing communication costs within a node Also helps to hide communication latency behind other operations important for overall performance QR factorization results on medium sized matrices show that adaptive load balancing is better than dynamic loop scheduling within a node CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 7 SPECseis95 z For gas and oil exploration Uses FFTs and finite difference solvers z Original message passing version in PVM is SPMD OpenMP starts serial then starts an SPMD parallel section In OpenMP version shared data is only boundaries everything else local like PVM version OpenMP calls all in Fortran no C OpenMP compiler caused difficulties for privatizing C global data and thread issues binding to processors OS calls z Code scales equally well for PVM and OpenMP on SGI Power Challenge a DSM machine This is a weak argument because of likely poor PVM message passing performance in general and especially on DSM systems CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 8 4 TLNS3D z z z z CFD in Fortran77 uses MPI across grids and OpenMP to parallelize each grid Multiple non overlapping grids blocks that exchange data at boundaries periodically Static block assignment to processors divide blocks into groups of about equal number of grid points for each processor Master worker execution model for MPI level then parallelize 3D loops for each block with OpenMP Many loops so need to be careful about affinity of data objects to processors across loops z z z Hard to balance MPI workers vs OpenMP threads per block tradeoff minimizing load imbalance vs communication and synchronization cost Seems to work best on DSMs but can be done well on distributed memory systems No performance results CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 9 CRETIN z z z Physics application with multiple levels of message passing and thread parallelism Ported onto both distributed memory system 1464 4processor nodes and DSM large SGI Origin 2000 Complex structure with 2 parts discussed Atomic kinetics multiple zones with lots of computation per zone maps to either MPI or OpenMP Load balancing across zones is the problem requires complex dynamic algorithm that benefits both versions Radiation transport mesh sweep across multiple zones suitable for both MPI and OpenMP Two MPI options to parallelize which one works best depends on problem size one needs a transpose operation for the MPI version z No performance results CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 10 5 Titanium CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 11 Titanium Features z Based on Java so object oriented Easy to extend since relatively small and clean Easy to learn if you know C C Java Safe language better for programmer and compiler z z z z z z Explicitly parallel SPMD execution model Global address space Zone based memory management Runs on both shared memory and distributedmemory parallel architectures with different language features performing better on different architectures Compiler implementation

View Full Document