View Full Document

Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems



View the full content.
View Full Document
View Full Document

3 views

Unformatted text preview:

DejaVu Transparent User Level Checkpointing Migration and Recovery for Distributed Systems Joseph F Ruscio1 1 Michael A Heffner2 Computing Systems Research Laboratory Department of Computer Science Virginia Tech VA 24061 USA jruscio srinidhi cs vt edu Abstract In this paper we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing migration and recovery of parallel and distributed applications DejaVu provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications or the OS It uses a new runtime mechanism for transparent incremental checkpointing that captures the least amount of state needed to maintain global consistency and provides a novel communication architecture that enables transparent migration of existing MPI codes without source code modifications Performance results from the production ready implementation show less than 5 overhead in real world parallel applications with large memory footprints 1 1 Introduction Enabling the next generation of computational infrastructures in particular the envisioned national cyberinfrastructure requires fundamental advances in transparent fault recovery The large component count inherent in the increasingly popular cluster based systems increases the in1 This material is based upon work supported in part by the National Science Foundation NSF under Grant No 0325534 Any opinions findings and conclusions or recommendations expressed in this material are those of the author s and do not necessarily reflect the views of the NSF This research used resources of the National Energy Research Scientific Computing Center which is supported by the Office of Science of the U S Department of Energy under Contract No DE AC03 76SF00098 1 4244 0910 1 07 20 00 c 2007 IEEE Srinidhi Varadarajan1 Evergrid VA 24060 USA mike heffner evergrid com 2 stability of the resource as a whole due to



Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?