DOC PREVIEW
UW-Madison CS 739 - Reliable Communication in the Presence of Failures

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Reliable Communication in the Presence of Failures KENNETH P. BIRMAN and THOMAS A. JOSEPH Cornell University The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems--distributed applications; distributed databases; C.4 [Computer Systems Organization]: Performance of Systems-reliability, availability, and serviceability; D.4.1 [Operating Systems]: Process Management-concurrency; synchronization; D.4.5 [Operating Systems]: Reliability- fault-tolerance; H.2.2 [Database Management]: Physical Design-recouery and restart General Terms: Reliability Additional Key Words and Phrases: Atomic broadcast, fault-tolerant process groups, reliable broad- cast 1. INTRODUCTION This paper presents a set of communication primitives for supporting distributed computations in an environment where failures could occur. We are primarily concerned with halting failures, whereby a process stops executing without performing any incorrect actions. Each distributed computation is represented as a set of events operating on a process state and a partial order on those events, corresponding to the thread of control. The types of events considered include local computations by a process, broadcasts from a process to a set of processes, This work was supported by the Defense Advanced Research Projects Agency (DOD) under ARPA order 5378, contract MDA903-85-C-0124, and by the National Science Foundation under grant DCR-8412582. Authors’ address: Department of Computer Science, Cornell University, Ithaca, NY 14853. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1987 ACM 0734-2071/87/0200-0047 $00.75 ACM Transactions on Computer Systems, Vol. 5, No. 1, February 1987, Pages 47-76.48 l K. P. Birman and T. A. Joseph broadcasts subject to predetermined ordering constraints, process failures, and process recoveries. Our premise is that event orderings should be subsumed into the commu- nication layer of a distributed system. In addition, since increasing concur- rency generally improves performance in distributed systems, we ask how much communication-level concurrency can be achieved while still respecting event- ordering constraints specified by the computations. An important feature of our approach is that it enables a process to deduce the event orderings that will be observed by other processes in the system. This simplifies higher level code and permits distributed computations to be implemented with reduced risk of incon- sistent actions being taken. The approach is formulated in the context of fault- tolerant process groups, which consist of a collection of processes that are cooperating to perform a distributed computation, and interacting using our communication protocols. In particular, when the term broadcast is used below, it refers to the transmission of a message from a process to the members of a process group (and possibly some additional processes), not to all sites or processes in the system, as has often been the case in prior work on broadcast protocols. An example will illustrate the class of problems that we address here. Consider a process p that is updating a replicated data item maintained by a set of data managers. Assume that this update is performed using a reliable broadcast: If any data manager receives the broadcast and remains operational, all data managers will receive it. If p fails, a data manager could observe any of several outcomes: 1. The data manager receives the update and then detects the failure. 2. It detects the failure and receives the update later. 3. It detects the failure, and the update is not delivered (anywhere). In an asynchronous system, a data manager may not be able to differentiate between the second and third outcomes in finite time. Moreover, if some data managers experience the first outcome and others the second one, the system must still behave correctly. One way to address problems such as these is for each process to run an agreement protocol to decide on what action to take after it detects a failure [16]. This approach could be slow because it is synchronous, and expensive because each process has to run such a protocol. Another possibility is to discard messages that are received by a process after it has learned that the sender has failed. However, inconsistencies may arise if messages are


View Full Document

UW-Madison CS 739 - Reliable Communication in the Presence of Failures

Documents in this Course
Load more
Download Reliable Communication in the Presence of Failures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reliable Communication in the Presence of Failures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reliable Communication in the Presence of Failures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?