Unformatted text preview:

6.826—Principles of Computer Systems 2002 Handout 18. Consensus 1 18. Consensus Consensus (sometimes called ‘reliable broadcast’ or ‘atomic broadcast’) is a fundamental building block for distributed systems. Informally, we say that several processes achieve consensus if they all agree on some value. Three obvious applications are: Distributed transactions, where all the processes need to agree on whether a transaction commits or aborts. Each transaction needs a new consensus on its outcome. Membership, where a set of processes cooperating to provide a highly available service need to agree on which processes are currently functioning as members of the set. Every time a process fails or starts working again there must be a new consensus. Electing a leader of a group of processes. Replicated state machines There is a much more general way to use consensus, as the mechanism for coding a highly available state machine, which is the basic tool for building a highly available system. The way to get availability is to have either perfect components or redundancy. Perfect components are too hard, which leaves redundancy. The simplest form of redundancy is replication: have several copies or replicas of each component, and make sure that all the non-faulty components do the same thing. Since any computation can be expressed as a state machine, a replicated state machine can make any computation highly available. Recall the basic idea of a replicated state machine: If the transition relation is deterministic (in other words, is a function from (state, input) to (new state, output)), then several copies of the state machine that start in the same state and see the same sequence of inputs will do the same thing, that is, end up in the same state and produce the same outputs. So if several processes are implementing the same state machine and achieve consensus on the values and order of the inputs, they will do the same thing. In this way it’s possible to replicate an arbitrary computation and thus make it highly available. Of course we can make the order a part of the value of the input by defining some total order on the set of possible inputs.1 We have already seen one application of this replicated state machine idea, in the code for transactions; there the replication takes the form of redoing a sequence of actions that is remembered in a log. Suppose, for example, that we want to build a highly available file system. The transitions are read and write operations on the files (and rename, list, … as well). We make several copies of the file system and make sure that they process read and write operations in the same order. A 1 This approach was first proposed in a classic paper by Leslie Lamport: Time, clocks, and the ordering of events in a distributed system, Comm. ACM 21, 7, July 1978, pp 558-565. This paper is better known for its analysis of the partial ordering of events in a distributed system, which is too bad. 6.826—Principles of Computer Systems 2002 Handout 18. Consensus 2 client sends its operation to some copy, which gets consensus that it is the next operation. Then all the copies do the operation, and one of them returns the result to the client. You might think that a read could be handled by any copy with no need for consensus, since it doesn’t change the state of the file system. Without consensus, however, a read might fail to see the result of a write that finished before the read started, since the read might be handled by a copy whose state is behind the current state of the file system. In many applications the inputs are requests from clients to the replicated service. Typically different clients generate their requests independently, so it’s necessary to agree not only on what the requests are, but also on the order in which to serve them. The simplest way to do this is to number them with consecutive integers, starting at 1. This is what is done in ‘primary copy’ replication, since it’s easy for one place (the primary) to assign consecutive numbers. The literature is full of other schemes for achieving consensus on the order of requests when their total order is not derived from consecutive integers. These schemes label each input with some label from a totally ordered set (for instance, (client UID, timestamp) pairs) and then devise some way to be certain that you have seen all the inputs that can ever exist with labels smaller than a given value. They are complicated, and of doubtful utility.2 People who do it for money use primary copy.3 Unfortunately, consensus is expensive. The section on optimizations at the end of this handout explains a variety of ways to make a replicated state machine run efficiently: leases, transactions, and batching. Spec for consensus Here is the spec for consensus; we have seen it already in handout 8 on history and prophecy variables. The idea is that the outcome of consensus should be one and only one of the allowed values. In the spec there is an outcome variable initialized to nil, and an action Allow(v) that can be invoked any number of times. There is also an action Outcome to read the outcome variable; it must return either nil or a v which was the argument of some Allow action, and if it doesn’t return nil it must always return the same v. More precisely, we have two requirements: Agreement: Every non-nil result of Outcome is the same. Validity: A non-nil outcome equals some allowed value. Validity means that the outcome can’t be any arbitrary value, but must be a value that was allowed. Consensus is reached by choosing some allowed value and assigning it to outcome. This spec makes the choice on the fly as the allowed values arrive. 2 For details, see F. Schneider, Implementing fault-tolerant services using the state-machine approach: A tutorial, ACM Computing Surveys 22 (Dec 1990). This paper is reprinted in the book Distributed Systems, 2nd edition, ed. S. Mullender, Addison-Wesley, 1993, pp 169-197. 3 Jim Gray said this about using locks for concurrency control; see handout 20.6.826—Principles of Computer Systems 2002 Handout 18. Consensus 3 MODULE Consensus [V] EXPORT Allow, Outcome = % data value to agree on VAR outcome : (V + Null) := nil APROC Allow(v) = << outcome = nil => outcome := v [] SKIP >> APROC Outcome() -> (V + Null) = << RET outcome [] RET nil >> END Consensus Note that Outcome is allowed to return nil even after the choice has been made.


View Full Document

MIT 6 826 - Study Notes

Documents in this Course
Consensus

Consensus

10 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?