Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems

Home> Academic Documents> Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems

DOC PREVIEW

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery forDistributed SystemsJoseph F. Ruscio1Michael A. Heffner2Srinidhi Varadarajan11Computing Systems Research LaboratoryDepartment of Computer ScienceVirginia Tech, VA 24061, USA{jruscio,srinidhi}@cs.vt.edu2Evergrid, VA 24060, USA{mike.heffner}@evergrid.comAbstractIn this paper, we present a new fault tolerance systemcalled DejaVu for transparent and automatic checkpointing,migration, and recovery of parallel and distributed applica-tions. DejaVu provides a transparent parallel checkpointingand recovery mechanism that recovers from any combina-tion of systems failures without any modification to parallelapplications or the OS. It uses a new runtime mechanismfor transparent incremental checkpointing that captures theleast amount of state needed to maintain global consistencyand provides a novel communication architecture that en-ables transparent migration of existing MPI codes, withoutsource-code modifications. Performance results from theproduction-ready implementation show less than 5% over-head in real-world parallel applications with large memoryfootprints.11 IntroductionEnabling the next generation of computational infras-tructures, in particular the envisioned national cyberinfras-tructure, requires fundamental advances in transparent faultrecovery. The large component count inherent in the in-creasingly popular cluster-based systems increases the in-1This material is based upon work supported in part by the NationalScience Foundation(NSF) under Grant No. 0325534. Any opinions, find-ings and conclusions or recommendations expressed in this material arethose of the author(s) and do not necessarily reflect the views of the NSF.This research used resources of the National Energy Research ScientificComputing Center, which is supported by the Office of Science of the U.S.Department of Energy under Contract No. DE-AC03-76SF00098.1-4244-0910-1/07/$20.00c2007 IEEE.stability of the resource as a whole due to the combinatorialdependency of the integrated system on single-componentfailure rates. Table 1 clearly illustrates that systems withlarge numbers of cores can expect high failure rates. In suchan environment some form of Checkpoint/Restart (CP/R) isthe only mechanism that can guarantee large computationaljobs will finish. For example in December 2001 LANL rana complete nuclear explosion simulation on ASCI whichconsumed 2000 processors for 4 months. Custom applica-tion level checkpointing code recorded the job’s state every70 minutes and during the course of that 4 month period thejob was restarted over 100 times [12]. Engendering stabil-ity in ever-growing, networked, collections of cluster sys-tems needs a software solution that provides reliable accessto computing resources through transparent, efficient, andautomatic checkpointing and recovery (CPR) mechanisms,a view echoed in the recent emphasis on recovery-orientedcomputing [16].We present a new system called DejaVu for transpar-ent and automatic checkpointing, migration and recovery ofparallel and distributed applications. DejaVu provides (a)a transparent parallel checkpointing and recovery mecha-nism that recovers from any combination of systems failureswithout modification to parallel applications. (b) a novelinstrumentation and state capture mechanism that transpar-ently captures application state, (c) novel runtime mech-anisms for transparent incremental checkpointing, to effi-ciently capture the least amount of state required to main-tain global consistency, (d) a novel communications archi-tecture that enables transparent migration of existing MPIcodes without source-code modifications, and (e) recover-able IO subsystems that can be tailored to specific storageenvironments.In this paper, we concentrate on a subset of the capa-System # CPUs ReliabilityASCI 8,192 MTBI: 6.5 Hours, 114 unplanned outages/month.Q Hardware outage sources: storage, CPU, memory.ASCI 8,192 MTBF: 5 hours (2001) and 40 hours (2003).White HW outage sources: storage, CPU, 3rd-party HW.NERSC 6,566 MTBI: 14 days. MTTR: 3.3 hours.Seaborg SW is the main outage source.Table 1. Reliability and Availability of Large-Scale HPC Systems [18]. (MTBI: Mean Time BetweenInterruptions, MTBF: Mean Time Between Failures, MTTR: Mean Time To Recovery)bilities of DejaVu. We present the core algorithm behindDejaVu, proof of global consistency, and its performanceon several applications on a cluster with Infiniband inter-connect and a shared GPFS file system.The rest of the paper is organized as follows. Section 2presents an overview of related work. Section 3 describesthe high level architecture and theoretical basis of DejaVu.Section 4 describes the implementation of DejaVu on x86-based Linux systems interconnected with Infiniband. Sec-tion 5 presents the results of several experiments used tovalidate the performance of the CPR framework. Section6 concludes the paper and presents directions for ongoingwork.2 Related WorkIn this section, we present a brief overview of the re-lated work in the area of checkpointing and recovery ofdistributed systems. The survey presented in [9] providesa much more comprehensive evaluation of the large vol-ume of work in the area of rollback-recovery protocols fordistributed systems. From the perspective of our work,rollback recovery protocols can broadly be classified alongtwo dimensions — coordination model and implementationlevel.2.1 Coordination ModelCheckpoint coordination model describes how a dis-tributed CP/R system orchestrates the individual check-points taken by each process to create a consistent aggregateglobal checkpoint. CP/R system’s coordination models areclassified as either uncoordinated or coordinated. Uncoordi-nated checkpointing methods do not synchronize the check-points of individual processes, rather each process sched-ules and takes its checkpoints independently. Dependencyinformation is tracked to determine the set of individualcheckpoints that comprise a globally consistent state. Eachprocess must maintain multiple (if not all) sequential check-points, as a dependency will most likely exist from someprocess’s most recent checkpoint to an older checkpoint ofanother process. Uncoordinated checkpointing is vulnera-ble to the domino effect [9] that can cause the system to rollall the way back to its initial state. Uncoordinated check-pointing schemes were conceived when network communi-cation operations were more expensive then


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Please select your school