DOC PREVIEW
UMD CMSC 714 - Lecture 5 MPI vs. OpenMP and Titanium

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CMSC 714 Lecture 5 MPI vs OpenMP and Titanium Alan Sussman Notes z z z MPI project due Friday 6PM Questions on debugging MPI programs Need volunteers to present papers Starting with Sisal programming language paper 1 week from today CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 2 1 OpenMP MPI z Some applications can take advantage of both message passing and threads Questions is what to do to obtain best overall performance without too much programming difficulty Choices are all MPI all OpenMP or both For both common option is outer loop parallelized with message passing inner loop with directives to generate threads z Applications studied Hydrology CGWAVE Computational chemistry GAMESS Linear algebra matrix multiplication and QR factorization Seismic processing SPECseis95 Computational fluid dynamics TLNS3D Computational physics CRETIN CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 3 Types of parallelism in the codes z For message passing parallelism MPI Parametric coarse grained outer loop essentially task parallel Structured domains domain decomposition with local operations structured and unstructured grids Direct solvers linear algebra lots of communication and load balancing required message passing works well for large systems of equations z Shared memory parallelism OpenMP Statically scheduled parallel loops one large or several smaller loops non nested parallel Parallel regions merge loops into one parallel region to reduce overhead of directives Dynamic load balanced when static scheduling leads to load imbalance from irregular task sizes CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 4 2 CGWAVE z z z z z z Finite elements MPI parameter space evaluation at outer loop OpenMP sparse linear equation solver in inner loops Speedup using 2 levels of parallelism allows modeling larger bodies of water possible in a reasonable amount of time Master worker strategy for dynamic load balancing in MPI part component Solver for each component solves large sparse linear system with OpenMP to parallelize On SGI Origin 2000 distributed shared memory machine use first touch rule to migrate data for each component to the processor that uses it Performance results show that best performance obtained using both MPI and OpenMP with a combination of MPI workers and OpenMP threads that depends on the problem grid size And for load balancing a lot fewer MPI workers than components CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 5 GAMESS z z Computational chemistry molecular dynamics MPI across cluster OpenMP within each node Built on top of Global Arrays package for distributed array operations Which in turn uses MPI paper says PVM and OpenMP z z z Linear algebra solvers mainly use OpenMP for dynamic scheduling and load balancing MPI versions of parts of code are complex but can provide higher performance for large problems Performance results on medium sized problem from SPEC Standard Performance Evaluation Corp are for a small system 4 8 processor Alpha processors connected by Memory Channel CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 6 3 Linear algebra z z z z z z Hybrid parallelism with MPI for scalability and OpenMP for load balancing for MM and QR factorization On IBM SP system with multiple 4 processor nodes Studies tradeoffs of hybrid approach for linear algebra algorithms vs only using MPI running 4 MPI processes per node Use OpenMP for load balancing and decreasing communication costs within a node Also helps to hide communication latency behind other operations important for overall performance QR factorization results on medium sized matrices show that adaptive load balancing is better than dynamic loop scheduling within a node CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 7 SPECseis95 z For gas and oil exploration Uses FFTs and finite difference solvers z Original message passing version in PVM is SPMD OpenMP starts serial then starts an SPMD parallel section In OpenMP version shared data is only boundaries everything else local like PVM version OpenMP calls all in Fortran no C OpenMP compiler caused difficulties for privatizing C global data and thread issues binding to processors OS calls z Code scales equally well for PVM and OpenMP on SGI Power Challenge a DSM machine This is a weak argument because of likely poor PVM message passing performance in general and especially on DSM systems CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 8 4 TLNS3D z z z z CFD in Fortran77 uses MPI across grids and OpenMP to parallelize each grid Multiple non overlapping grids blocks that exchange data at boundaries periodically Static block assignment to processors divide blocks into groups of about equal number of grid points for each processor Master worker execution model for MPI level then parallelize 3D loops for each block with OpenMP Many loops so need to be careful about affinity of data objects to processors across loops z z z Hard to balance MPI workers vs OpenMP threads per block tradeoff minimizing load imbalance vs communication and synchronization cost Seems to work best on DSMs but can be done well on distributed memory systems No performance results CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 9 CRETIN z z z Physics application with multiple levels of message passing and thread parallelism Ported onto both distributed memory system 1464 4processor nodes and DSM large SGI Origin 2000 Complex structure with 2 parts discussed Atomic kinetics multiple zones with lots of computation per zone maps to either MPI or OpenMP Load balancing across zones is the problem requires complex dynamic algorithm that benefits both versions Radiation transport mesh sweep across multiple zones suitable for both MPI and OpenMP Two MPI options to parallelize which one works best depends on problem size one needs a transpose operation for the MPI version z No performance results CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 10 5 Titanium CMSC 714 Fall07 Alan Sussman Jeffrey K Hollingsworth 11 Titanium Features z Based on Java so object oriented Easy to extend since relatively small and clean Easy to learn if you know C C Java Safe language better for programmer and compiler z z z z z z Explicitly parallel SPMD execution model Global address space Zone based memory management Runs on both shared memory and distributedmemory parallel architectures with different language features performing better on different architectures Compiler implementation


View Full Document

UMD CMSC 714 - Lecture 5 MPI vs. OpenMP and Titanium

Documents in this Course
MTOOL

MTOOL

7 pages

BOINC

BOINC

21 pages

Eraser

Eraser

14 pages

Load more
Download Lecture 5 MPI vs. OpenMP and Titanium
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 5 MPI vs. OpenMP and Titanium and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 5 MPI vs. OpenMP and Titanium 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?