DOC PREVIEW
UMD CMSC 714 - MPI vs. OpenMP and Titanium

This preview shows page 1 out of 4 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 4 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CMSC 714Lecture 5MPI vs. OpenMPand TitaniumAlan Sussman2CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthNotesz First programming assignment coming soon– Slight change from original plan – you’ll write one program, first using either OpenMP or MPI, then the otherz Lorin Hochstein will talk at the end of class today on his study of all of you writing parallel programsz First, questions on OpenMP and UPC– Directives vs. language extensions3CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthOpenMP + MPIz Some applications can take advantage of both message passing and threads– Questions is what to do to obtain best overall performance, without too much programming difficulty– Choices are all MPI, all OpenMP, or both•For both, common option is outer loop parallelized with message passing, inner loop with directives to generate threadsz Applications studied:–Hydrology –CGWAVE– Computational chemistry – GAMESS– Linear algebra – matrix multiplication and QR factorization– Seismic processing – SPECseis95– Computational fluid dynamics – TLNS3D– Computational physics - CRETIN4CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthTypes of parallelism in the codesz For message passing parallelism (MPI)– Parametric – coarse-grained outer loop, essentially task parallel– Structured domains – domain decomposition with local operations – structured and unstructured grids– Direct solvers – linear algebra, lots of communication and load balancing required – message passing works well for large systems of equationsz Shared memory parallelism (OpenMP)– Statically scheduled parallel loops – one large, or several smaller loops, non-nested parallel– Parallel regions – merge loops into one parallel region to reduce overhead of directives– Dynamic load balanced – when static scheduling leads to load imbalance from irregular task sizes5CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthCGWAVEz Finite elements - MPI parameter space evaluation at outer loop, OpenMP sparse linear equation solver in inner loopsz Speedup using 2 levels of parallelism allows modeling larger bodies of water possible in a reasonable amount of timez Master-worker strategy for dynamic load balancing in MPI part/componentz Solver for each component solves large sparse linear system with OpenMP to parallelizez On SGI Origin 2000 (distributed shared memory machine), use first touch rule to migrate data for each component to the processor that uses itz Performance results show that best performance obtained using both MPI and OpenMP, with a combination of MPI workers and OpenMP threads that depends on the problem/grid size– And for load balancing, a lot fewer MPI workers than componentsCMSC 714Lecture 5MPI vs. OpenMPand TitaniumAlan Sussman27CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthNotesz First programming assignment is on web page– Still a work in progress– You will get email telling you which parallel version to do first, OpenMP or MPI– Still some issues with MPI on the cluster, but we’re working on themz Not here on Tuesday– Jeff Hollingsworth will teach the class, on HPF– No questions for next time, since you’ve already sent me HPF questions …z Need volunteers to present papers– Starting with Sisal programming language paper, 1 week from today8CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthGAMESSz Computational chemistry – molecular dynamics –MPI across cluster, OpenMP within each nodez Built on top of Global Arrays package – for distributed array operations– Which in turn uses MPI (paper says PVM) and OpenMPz Linear algebra solvers mainly use OpenMP for dynamic scheduling and load balancingz MPI versions of parts of code are complex, but can provide higher performance for large problemsz Performance results on “medium” sized problem from SPEC (Standard Evaluation Performance Corp.) are for a small system (4 8-processor Alpha processors) connected by Memory Channel9CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthLinear algebraz Hybrid parallelism with MPI for scalability and OpenMP for load balancing, for MM and QR factorizationz On IBM SP system with multiple 4-processor nodesz Studies tradeoffs of hybrid approach for linear algebra algorithms vs. only using MPI (running 4 MPI processes per node)z Use OpenMP for load balancing and decreasing communication costs within a nodez Also helps to hide communication latency behind other operations – important for overall performancez QR factorization results on “medium” sized matrices show that adaptive load balancing is better than dynamic loop scheduling within a node10CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthSPECseis95z For gas and oil exploration– Uses FFTs and finite-difference solversz Original message passing version (in PVM) is SPMD, OpenMP starts serial then starts an SPMD parallel section– In OpenMP version, shared data is only boundaries, everything else local (like PVM version)– OpenMP calls all in Fortran – no C OpenMP compiler –caused difficulties for privatizing C global data, and thread issues (binding to processors, OS calls)z Code scales equally well for PVM and OpenMP, on SGI Power Challenge (a DSM machine)– This is a weak argument, because of likely poor PVM message passing performance (in general, and especially on DSM systems)11CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthTLNS3Dz CFD in Fortran77, uses MPI across grids and OpenMP to parallelize each gridz Multiple, non-overlapping grids/blocks that exchange data at boundaries periodicallyz Static block assignment to processors – divide blocks into groups of about equal number of grid points for each processorz Master-worker execution model for MPI level, then parallelize 3D loops for each block with OpenMP– Many loops, so need to be careful about affinity of data objects to processors across loopsz Hard to balance MPI workers vs. OpenMP threads per block –tradeoff minimizing load imbalance vs. communication and synchronization costz Seems to work best on DSMs, but can be done well on distributed memory systemsz No performance results!12CMSC 714, Fall05 - Alan Sussman & Jeffrey K. HollingsworthCRETINz Physics application with multiple levels of message passing and thread parallelismz Ported onto both distributed memory system (1464 4-processor nodes) and DSM (large SGI Origin 2000)z Complex structure,


View Full Document

UMD CMSC 714 - MPI vs. OpenMP and Titanium

Documents in this Course
MTOOL

MTOOL

7 pages

BOINC

BOINC

21 pages

Eraser

Eraser

14 pages

Load more
Download MPI vs. OpenMP and Titanium
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view MPI vs. OpenMP and Titanium and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MPI vs. OpenMP and Titanium 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?