Reliable Communication in the Presence of Failures KENNETH P BIRMAN and THOMAS A JOSEPH Cornell University The design and correctness of a communication facility for a distributed computer system are reported on The facility provides support for fault tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local and wide area networks These protocols attain high levels of concurrency while respecting application specific delivery ordering constraints and have varying cost and performance that depend on the degree of ordering desired In particular a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols The facility also ensures that the processes belonging to a fault tolerant process group will observe consistent orderings of events affecting the group as a whole including process failures recoveries migration and dynamic changes to group properties like member rankings A review of several uses for the protocols in the ISIS system which supports fault tolerant resilient objects and bulletin boards illustrates the significant simplification of higher level algorithms made possible by our approach Categories and Subject Descriptors C 2 4 Computer Communication Networks Distributed Systems distributed applications distributed databases C 4 Computer Systems Organization Performance of Systems reliability availability and serviceability D 4 1 Operating Systems Process Management concurrency synchronization D 4 5 Operating Systems Reliabilityfault tolerance H 2 2 Database Management Physical Design recouery and restart General Terms Reliability Additional cast Key Words and Phrases Atomic broadcast fault tolerant process groups reliable broad 1 INTRODUCTION This paper presents a set of communication primitives for supporting distributed computations in an environment where failures could occur We are primarily concerned with halting failures whereby a process stops executing without performing any incorrect actions Each distributed computation is represented as a set of events operating on a process state and a partial order on those events corresponding to the thread of control The types of events considered include local computations by a process broadcasts from a process to a set of processes This work was supported by the Defense Advanced Research Projects Agency DOD under ARPA order 5378 contract MDA903 85 C 0124 and by the National Science Foundation under grant DCR 8412582 Authors address Department of Computer Science Cornell University Ithaca NY 14853 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage the ACM copyright notice and the title of the publication and its date appear and notice is given that copying is by permission of the Association for Computing Machinery To copy otherwise or to republish requires a fee and or specific permission 0 1987 ACM 0734 2071 87 0200 0047 00 75 ACM Transactions on Computer Systems Vol 5 No 1 February 1987 Pages 47 76 48 l K P Birman and T A Joseph broadcasts subject to predetermined ordering constraints process failures and process recoveries Our premise is that event orderings should be subsumed into the communication layer of a distributed system In addition since increasing concurrency generally improves performance in distributed systems we ask how much communication level concurrency can be achieved while still respecting eventordering constraints specified by the computations An important feature of our approach is that it enables a process to deduce the event orderings that will be observed by other processes in the system This simplifies higher level code and permits distributed computations to be implemented with reduced risk of inconsistent actions being taken The approach is formulated in the context of faulttolerant process groups which consist of a collection of processes that are cooperating to perform a distributed computation and interacting using our communication protocols In particular when the term broadcast is used below it refers to the transmission of a message from a process to the members of a process group and possibly some additional processes not to all sites or processes in the system as has often been the case in prior work on broadcast protocols An example will illustrate the class of problems that we address here Consider a process p that is updating a replicated data item maintained by a set of data managers Assume that this update is performed using a reliable broadcast If any data manager receives the broadcast and remains operational all data managers will receive it If p fails a data manager could observe any of several outcomes 1 The data manager receives the update and then detects the failure 2 It detects the failure and receives the update later 3 It detects the failure and the update is not delivered anywhere In an asynchronous system a data manager may not be able to differentiate between the second and third outcomes in finite time Moreover if some data managers experience the first outcome and others the second one the system must still behave correctly One way to address problems such as these is for each process to run an agreement protocol to decide on what action to take after it detects a failure 16 This approach could be slow because it is synchronous and expensive because each process has to run such a protocol Another possibility is to discard messagesthat are received by a process after it has learned that the sender has failed However inconsistencies may arise if messages are discarded by one process but retained by another one that learns of the failure later A third alternative representative of the general approach of this paper is to construct a broadcast protocol that orders messages relative to failure and recovery events such that these problems do not arise In the approach we develop here the data managers would form a fault tolerant process group The communication primitives ensure that every data manager experiences the same sequence of events hence a data manager can perform an update immediately upon receiving the corresponding message Likewise it can take a recovery action immediately after detecting a failure because no other data manager will observe an inconsistent ordering of events ACM Transactions on Computer
View Full Document
Unlocking...