DOC PREVIEW
CORNELL CS 614 - Study Notes

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Using Time Instead of Timeout for Fault-Tolerant Distributed Systems LESLIE LAMPORT SRI International A general method is described for implementing a distributed system with any desired degree of fault- tolerance. Instead of relying upon explicit timeouts, processes execute a simple clock-driven algorithm. Reliable clock synchronization and a solution to the Byzantine Generals Problem are assumed. Categories and Subject Descriptors: C.2.4 [Computer-Communications Networks]: Distributed Systems--network operating systems; D.1.3 [Programming Techniques]: Concurrent Program- ming; D.4.1 [Operating Systems]: Process Management--synchronization; D.4.3 [Operating Sys- tems]: File Systems Management--distributed file systems; D.4.5 [Operating Systems]: Reliabil- ity--/ault-toleranee; D.4.7 [Operating Systems]: Organization and Design--distributed systems; real-time systems General Terms: Design, Reliability Additional Key Words and Phrases: Clocks, transaction commit, timestamps, in~ractive consistency, Byzantine Generals Problem 1. INTRODUCTION In programming asynchronous multiprocess systems, the customary approach has been to make process synchronization independent of the execution rates of any components. This requires synchronization algorithms in which one process must wait for another to do something before it can proceed. In distributed systems, this means waiting for a message from the other process. These time- independent algorithms cannot be fault-tolerant because a process could fail by doing nothing, and such a failure manifests itself only as a reduction of the process's execution rate [5]. The usual method of obtaining fault-tolerant synchronization in distributed systems is to add timeouts to time-independent algorithms. A process sets a timer whenever it begins waiting for another process, and a failure is assumed to have occurred if a certain period of time elapses without a response from the other This work was supported in part by the National Science Foundation under Grant. No. MCS 78- 16783, and in part by the Ballistic Missile Defense Agency under Contract No. DASG60-78-C-0046. Author's address: Computer Science Laboratory, SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1984 ACM 0164-0925/84/0400-0254 $00.75 ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2 April 1984, 254-280.Using Time Instead of Timeout • 255 process. A number of fault-tolerant synchronization algorithms have been pro- posed that use timeouts in this way. However, these algorithms provide only a limited degree of fault-tolerance. Every previously published synchronization algorithm that we know of can be defeated by the failure of a single component. The "fault-tolerant" algorithms are tolerant only of restricted kinds of failure, and can fail if a faulty process sends conflicting information to two other processes. Moreover, most of the algorithms provide only ad hoc solutions to individual problems. An exception is the method of [12], which provides a general approach to distributed synchronization that is similar to ours, but it assumes "nice" failures. The use of timeouts rests upon assumptions about the real-time behavior of the system. It is assumed that a waiting process can infer from the occurrence of a timeout that a failure has occurred. In this paper, we show how assumptions about real-time behavior can be used to infer information other than the existence of a failure. We describe a general algorithm to achieve any desired form of synchronization with any desired degree of fault-tolerance--including the ability to tolerate totally arbitrary and malicious failures. The algorithm is based on the use of absolute times instead of timeouts, and can be considered an extension of the approach of [6], achieving fault-tolerance by using physical instead of logical clocks. The generality of the algorithm is demonstrated by applying it to several distributed computing problems. A resource allocation problem is considered in some detail, and other applications are briefly sketched, including a robust distributed database and a reliable transaction commit protocol. Achieving high reliability is expensive; the exact cost is discussed in the conclusion. It may therefore be impractical to implement the entire system with our algorithm. In this case, the algorithm can be used to implement a reliable synchronization kernel, which can maintain system integrity even though com- ponent failures cause the loss of some functionality. This is illustrated by a brief discussion of three examples--a distributed file system, the transaction commit protocol, and a reliable fault detection scheme. 2. THE ASSUMPTIONS We model a distributed system as a network of processes joined by communication links (not necessarily completely connected). Each process executes an event- driven algorithm, where an event is the arrival of a message or the process's clock reaching a certain value. The use of timeouts rests upon the following assumption. Assumption UC1. For any event e that causes process i to send a message to process j, there is a 5 such that if event e occurs at time T and processes i and j and the communication link joining them are nonfaulty, then the message arrives at process j by time T + 5--where time T and the time when the message arrives are either both measured according process i's clock, or are both measured according to process j's clock. In general, the value of 5 may depend upon the processes i and j and the particular event e. For simplicity, we assume that there is a single constant that works for all i, j, and e. This saves us from having to keep track of many ACM Transactions on Programming Languages and Systems, Vol. 6, No. 2, April 1984.256 Leslie Lamport different 5's. The important assumption being made


View Full Document

CORNELL CS 614 - Study Notes

Documents in this Course
Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?