DOC PREVIEW
UW-Madison CS 739 - Replicated Distributed Programs

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Abstract Replicated Distributed Programs Eric C. Cooper Computer Science Division--EECS University of California Berkeley, California 94720 A troupe is a set of replicas of a module, executing on machines that have independent failure modes. Troupes are the building blocks of replicated distributed programs and the key to achieving high availability. Individual members of a troupe do not communicate among themselves, and axe unaware of one another's existence; this property is what distinguishes troupes from other software architectures for fault tolerance. Replicated procedure call is introduced to handle the many-to-many pattern of conmmnication between troupes. The semantics of replicated procedure call can be summa- rized as exactly-once execution at all replicas. An implementation of troupes and replicated procedure call is described, and its performance is measured. The problem of concurrency control for troupes is examined, and a commit protocol for replicated atomic transactions is presented. Binding and reconfiguration mechanisms for replicated distributed programs are described. 1 Introduction This paper addresses the problem of constructing highly available distributed programs. (The adjectives highly available, fault-tolerant, and nonstop will be used synony- mously to describe a system that continues to operate de- spite failures of some of its components.) The goal is to construct programs that automatically tolerate crashes of the underlying hardware. The problems posed by incorrect software or by hardware failures other than crashes are only addressed briefly. The key to tolerating component failures is replication; Author's present address: Department of Computer Science, Carnegie.. Mellon University, Pittsburgh, Pennsylvania 15213. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1985 ACM-0-89791- 174- 1- 12/85-0063 $00.75 this approach was proposed by yon Neumann thirty years ago [29]. The idea is to replicate each component to such a degree that the probability of all replicas failing becomes acceptably small. The advent of inexpensive distributed computing systems (consisting of computers connected to- gether by a network) makes replication an attractive and practical means of tolerating hardware crashes. The ability to vary replication on a per-module basis is desirable because it allows software systems to adapt grace- fully to changing characteristics of the underlying hard- ware. Even if perfectly reliable hardware were possible, there would still be periods during which hardware would be unavailable: scheduled down-time for preventive main- tenance or reconfiguration, for example. The mechanisms described in this paper permit distributed programs to be reconfigured, while they axe executing, so that their services remain available during such periods. Incorporating replication on a per-module basis is more flexible than previous approaches, such as providing fault tolerance in hardware or writing it into the application software. The first method is too expensive because it uses reliable hardware everywhere, not just for critical modules. The second approach burdens the programmer with the complexity of a non-transparent mechanism. The fundamental mechanisms presented in this paper are: • troupes, or replicated modules, and • replicated procedure call, a generalization of remote procedure call for many-to-many communication be- tween troupes. The following important property is what distinguishes troupes and replicated procedure call from previous soft- ware architectures for fault tolerance: individual members of a troupe do not communicate among themselves, and axe unaware of one another's existence. This property is also what gives these mechanisms their flexibility and power: since each troupe member behaves as if it had no replicas, the degree of replication of a. troupe can be varied dynam- ically, with no recompilation or relinking. 63Previous papers presented the author's initial ideas about replicated procedure calls [10] and a description of the Circus system [11]. This paper presents a portion of the author's Ph.D. dissertation [12]. 2 Background and Related Work The idea of achieving fault tolerance by using replica- tion to mask the failures of individual components dates back to yon Neumann [29]. The two architectures for fault- tolerant software are primary-standby systems and modular redundancy. In a primary-standby scheme, only a single component functions normally; the remaining replicas are on standby in case the primary fails. With modular redun- dancy, each component performs the same function; there is some form of voting on the outputs to mask failures. A classic primary-standby architecture is the method of process pairs in Tandem's Guardian operating system [1]. The processes in a process pair execute on different proces- sors. One process is designated as the primary, the other as the standby. Before each request is processed, the primary sends information about its internal state to the standby, in the form of a checkpoint. The checkpoint enables the standby to complete the request if the primary fails. The Auragen architecture combines a primary-standby scheme with automatic logging of messages [6]. If a primary crashes, the log is used to replay the appropriate messages to a standby. The Isis project at Cornell uses a primary-standby architecture for replicated objects [3]. In each interaction with a replicated object in Isis, one replica plays the role of coordinator, and only it performs the operation. The coordinator then uses a two-phase commit protocol to update the other replicas. The mechanisms used in primary-standby schemes to allow a standby to take over after the primary crashes are isomorphic to crash recovery mechanisms based on stable storage. Under this isomorphism, a standby corresponds to stable storage


View Full Document

UW-Madison CS 739 - Replicated Distributed Programs

Documents in this Course
Load more
Download Replicated Distributed Programs
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Replicated Distributed Programs and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Replicated Distributed Programs 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?