Unformatted text preview:

Replicated Distributed Programs Eric C Cooper Computer Science Division EECS University of California Berkeley California 94720 Abstract A troupe is a set of replicas of a module executing on this approach was proposed by yon Neumann thirty years ago 29 The idea is to replicate each component to such a degree that the probability of all replicas failing becomes acceptably small The advent of inexpensive distributed machines that have independent failure modes Troupes are the building blocks of replicated distributed programs and the key to achieving high availability Individual members of a troupe do not communicate among themselves and a x e unaware of one another s existence this property is what distinguishes troupes from other software architectures for computing systems consisting of computers connected together by a network makes replication an attractive and practical means of tolerating hardware crashes fault tolerance The ability to vary replication on a per module basis is desirable because it allows software systems to adapt gracefully to changing characteristics of the underlying hardware Even if perfectly reliable hardware were possible there would still be periods during which hardware would be unavailable scheduled down time for preventive maintenance or reconfiguration for example The mechanisms described in this paper permit distributed programs to be reconfigured while they axe executing so that their services remain available during such periods Incorporating replication on a per module basis is more flexible than previous approaches such as providing fault tolerance in hardware or writing it into the application software The first method is too expensive because it uses reliable hardware everywhere not just for critical modules The second approach burdens the programmer with the complexity of a non transparent mechanism The fundamental mechanisms presented in this paper Replicated procedure call is introduced to handle the many to many pattern of conmmnication between troupes The semantics of replicated procedure call can be summarized as exactly once execution at all replicas An implementation of troupes and replicated procedure call is described and its performance is measured The problem of concurrency control for troupes is examined and a commit protocol for replicated atomic transactions is presented Binding and reconfiguration mechanisms for replicated distributed programs are described 1 Introduction This paper addresses the problem of constructing highly available distributed programs The adjectives highly available fault tolerant and nonstop will be used synonymously to describe a system that continues to operate despite failures of some of its components The goal is to construct programs that automatically tolerate crashes of the underlying hardware The problems posed by incorrect software or by hardware failures other than crashes are only addressed briefly The key to tolerating component failures is replication are troupes or replicated modules and replicated procedure call a generalization of remote procedure call for many to many communication between troupes The following important property is what distinguishes troupes and replicated procedure call from previous software architectures for fault tolerance individual members of a troupe do not communicate among themselves and axe unaware of one another s existence This property is also what gives these mechanisms their flexibility and power since each troupe member behaves as if it had no replicas the degree of replication of a troupe can be varied dynamically with no recompilation or relinking Author s present address Department of Computer Science Carnegie Mellon University Pittsburgh Pennsylvania 15213 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage the ACM copyright notice and the title of the publication and its date appear and notice is given that copying is by permission of the Association for Computing Machinery To copy otherwise or to republish requires a fee and or specific permission 1985 ACM 0 89791 174 1 12 85 0063 00 75 63 Gifford s weighted voting scheme uses quorums and version numbers to provide replication transparency for files 15 Herlihy applied Gifford s quorums to replicated abstract data types 19 by taking advantage of the particular semantics of the data types Gunningberg s design of a fault tolerant message protocol based on triple modular redundancy 17 is similar to but less general than the replicated mechanisms presented in this paper A methodology known as N version programming uses multiple implementations of the same module specification to mask software faults 7 This technique can be used in conjunction with the replicated modules proposed in the present work by using independently implemented modules instead of exact replicas thereby increasing software as Previous papers presented the author s initial ideas about replicated procedure calls 10 and a description of the Circus system 11 This paper presents a portion of the author s Ph D dissertation 12 2 Background and Related Work The idea of achieving fault tolerance by using replication to mask the failures of individual components dates back to yon Neumann 29 The two architectures for faulttolerant software are primary standby systems and modular redundancy In a primary standby scheme only a single component functions normally the remaining replicas are on standby in case the primary fails With modular redundancy each component performs the same function there is some form of voting on the outputs to mask failures A classic primary standby architecture is the method of process pairs in Tandem s Guardian operating system 1 The processes in a process pair execute on different processors One process is designated as the primary the other as the standby Before each request is processed the primary sends information about its internal state to the standby in the form of a checkpoint The checkpoint enables the standby to complete the request if the primary fails The Auragen architecture combines a primary standby scheme with automatic logging of messages 6 If a primary crashes the log is used to replay the appropriate messages to a standby The Isis project at Cornell uses a primary standby architecture for replicated objects 3 In each interaction with a replicated object in Isis one replica plays the role of


View Full Document

UW-Madison CS 739 - Replicated Distributed Programs

Documents in this Course
Load more
Loading Unlocking...
Login

Join to view Replicated Distributed Programs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Replicated Distributed Programs and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?