Performance Evaluation of Adaptive MPI Chao Huang Gengbin Zheng Laxmikant Kale Sameer Kumar IBM T J Watson Research Center Yorktown Heights NY 10598 USA sameerk us ibm com University of Illinois at Urbana Champaign Urbana IL 61801 USA chuang10 gzheng kale cs uiuc edu Abstract 1 Processor virtualization via migratable objects is a powerful technique that enables the runtime system to carry out intelligent adaptive optimizations like dynamic resource management C HARM is an early language system that supports migratable objects This paper describes Adaptive MPI or AMPI an MPI implementation and extension that supports processor virtualization AMPI implements virtual MPI processes VPs several of which may be mapped to a single physical processor AMPI includes a powerful runtime support system that takes advantage of the degree of freedom afforded by allowing it to assign VPs onto processors With this runtime system AMPI supports such features as automatic adaptive overlapping of communication and computation automatic load balancing flexibility of running on arbitrary number of processors and checkpoint restart support It also inherits communication optimization from C HARM framework This paper describes AMPI illustrates its performance benefits through a series of benchmarks and shows that AMPI is a portable and mature MPI implementation that offers various performance benefits to dynamic applications The new generation of parallel applications are complex involve simulation of dynamically varying systems and use adaptive techniques such as multiple timestepping and adaptive refinements Typical implementations of the MPI do not support the dynamic nature of these applications well As a result programming productivity and parallel efficiency suffer In this paper we present performance evaluation of Adaptive MPI AMPI an adaptive implementation of MPI Through analysis of the results from a series of benchmarks we illustrate that AMPI while still retaining the familiar programming model of MPI is better suited for such new generation applications and does not penalize performance of those applications without the dynamic nature The key concept behind AMPI is processor virtualization Standard MPI programs divide the computation onto P processes and typical MPI implementations simply execute each process on one of the P processors In contrast an AMPI programmer divides the computation into a number V of virtual processors VPs and AMPI runtime system maps these VPs onto P physical processors In other words AMPI provides an effective division of labor between the programmer and the system The programmer still programs each process with the same syntax as specified in the MPI Standard Further not being restricted by the physical processors he she is able to design more flexible partitioning that best fits the nature of the parallel problem The runtime system on the other hand has the opportunity of adaptively mapping and re mapping the programmer s virtual processors onto the physical machine In AMPI the MPI processes are implemented by user level threads embedded in migratable parallel objects many of which can be mapped onto one physical processor The number of virtual processors V and the number of physical processors P are independent allowing the programmer to design more natural expression of the algorithm For example algorithmic considerations often restrict the number of processors to a power of 2 or a cube and with AMPI V can still be a cube even though P is prime When V P the program executes the same way it would with other MPI im Categories and Subject Descriptors D 1 3 Concurrent Programming Parallel programming General Terms Performance Experimentation Languages Keywords MPI Adaptivity Processor Virtualization Load Balancing Communication Optimization This work was supported in part by DOE Grant B341494 and B505214 and by the National Science Foundation through Grant ITR 0205611 and TeraGrid resources at NCSA and Terascale Computing System at the Pittsburgh Supercomputing Center Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee PPoPP 06 March 29 31 2006 New York New York USA Copyright c 2006 ACM 1 59593 189 9 06 0003 5 00 Introduction plementation and it enjoys only part of the benefit of AMPI such as collective communication optimization To take full advantage of the AMPI runtime system typically we have V significantly larger than P Before describing the details for design and implementation of AMPI and the underlying C HARM Framework we first motivate AMPI by explaining the benefits of using multiple virtual processors per physical processor 1 1 Benefits of Virtualization In 1 the authors have discussed in detail the benefits of processor virtualization in parallel programming The C HARM system has indeed taken full advantage of these productivity benefits AMPI inherits most of the merits from C HARM while furnishing the common MPI programming environment The following is a list of the benefits that we will demonstrate in this paper We will show that AMPI with these benefits effectively improves the performance of complex and dynamic parallel programs with virtualization and incurs very little overhead for applications without the dynamic nature Adaptive overlapping of communication and computation If one of the virtual processors is blocked on a receive another virtual processor on the same physical processor can run This largely eliminates the need for the programmer to manually specify some static computation communication overlapping as is often required in MPI Automatic load balancing If some of the physical processors become overloaded the runtime system can migrate a few of their virtual processors to relatively underloaded physical processors Our runtime system can make such load balancing decision based on automatic instrumentation Flexibility to run on arbitrary number of processors Since more than one VPs can be executed on one physical processor AMPI is capable of running MPI programs on any arbitrary number of processors This feature proves to be useful in application development and debugging phases Optimized
View Full Document
Unlocking...