DOC PREVIEW
MIT 6 852 - The ABCD’s of Paxos

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 The ABCD’s of Paxos Butler W. Lampson Microsoft 180 Lake View Ave Cambridge, MA 02138 1-617-547-9580 [email protected] Abstract We explain how consensus is used to implement replicated state machines, the general mechanism for fault-tolerance. We describe an abstract version of Lamport’s Paxos algorithm for asynchro-nous consensus. Then we derive the Byzantine, classic, and disk versions of Paxos from the abstract one, show how they are re-lated to each other, and discuss the safety, liveness, and perform-ance of each one. Categories and Subject Descriptors D.2.4 [Software] Correctness Proofs—abstraction function, in-variant, simulation; Fault Tolerance—Byzantine, Paxos, repli-cated state machine, view change. [Theory]—consensus, liveness, safety. General Terms Algorithms, Reliability, Security, Theory Keywords Paxos, asynchronous consensus, fault-tolerant, replication, Lam-port, Byzantine, state machine 1 Introduction We give an abstract version AP of Lamport’s Paxos algorithm for asynchronous consensus that captures its idea, but is not directly implementable because some of the actions touch non-local state. Then we give three implementations of AP that solve this prob-lem in different ways, together with the abstractions and invari-ants of their simulation proofs: Classic Paxos, CP, from Lamport’s original paper [9] and from Liskov and Oki [13], tolerates n/2 stopped processes and requires conditional write (compare and swap) opera-tions on persistent state variables. Disk Paxos, DP, from Gafni and Lamport’s recent paper [5], is a generalization of AP and CP that requires only read and write operations on persistent state variables. Byzantine Paxos, BP, described by Castro and Liskov in [1] and [2], tolerates n/3 processes with arbitrary faults. Their papers also describe a replicated state machine implementa-tion, based on BP, that has good performance and the same fault tolerance. AP, CP, and BP are summarized in the appendix. I’ve tried to answer all the questions I had when I read these pa-pers, about how simple the algorithms can be made, the minimum conditions for them to work, and how they are related. The role that General played in the original Paxos paper makes it especially appropriate for me to write about a Byzantine version. I don’t know whether a practical algorithm could be developed in this top-down fashion. Certainly the three that we give were not invented in this way, but our exposition does clarify the relation-ships among them and perhaps will suggest other variations.1 The top-down development often works by introducing new variables that are related to the abstract variables by an invariant, and modifying the actions so that they depend only on the new variables and not on the abstract ones. The abstract variables thus become history variables in the proof. 1.1 Replicated state machines The main application for fault-tolerant consensus is replicated state machines. This is the fundamental technique for general fault-tolerance, first described by Lamport [7]. It goes like this: Cast your problem as a deterministic state machine that takes input requests for state transitions, called steps, from the cli-ent, performs the steps, and returns the output to the client. It’s hard to find a problem that can’t be done this way. Implement the state machine and make n copies of it, often called ‘replicas’. Using consensus, feed all the replicas the same input se-quence. Then they all generate the same output sequence. If a replica fails, it can recover by starting in the initial state and replaying all the inputs. As with transaction systems [6], it can speed up this complete replay by starting with a previous state instead of at the beginning. The steps of the state machine can be arbitrarily complicated as long as they are deterministic, atomic, and strictly local to one replica. To make a big step atomic, use transactions [6]. Of course a replica can involve more than one physical machine; in fact, like any good idea in computer science, the entire method can be ap-plied recursively. Even reading the state must be done with a step, unless the cli-ent is willing to accept output based on an arbitrarily old state. If a read also returns the sequence number of the last step that affected it, the client can pay for better read performance with complexity by doing an occasional step to learn the current step, and then 1 There is a similar treatment of reliable messages in [10] and [11].2 accepting read outputs that are not too far out of date. With a sloppy notion of real time the state machine can give the client a bound on number of seconds a read might be out of date. Since fault-tolerant consensus makes all the inputs persistent, exactly-once semantics needs no extra persistent writes. The state machine does have to check that an input hasn’t been accepted already, which it can do by caching the sequence number or time-stamp of the most recent input from each client. The most common application is to data storage systems such as a file system [13]. The method is much more general, however. For instance, state machine actions can be used to change the sets of processes that form the various quorums on which consensus depends, so that no special algorithms are needed to deal with processes that arrive and depart in an orderly way. Many applications combine a replicated state machine with leases, which are locks on portions of the state. A lease differs from a lock because it times out, so the system doesn’t block in-definitely if the leaseholder fails. To keep the lock the holder must renew the lease. There is an obvious tradeoff between the cost of frequent renewals and the cost of waiting for the lease to expire. A client (or a subordinate state machine) with a lease can do arbitrary reads and writes of the leased state without taking any steps of the main state machine, except for a single step that com-bines all the writes. The most important use of leases is to allow holders to cache part of the state. Like locks, leases can be hierar-chical and can have different modes such as shared and exclusive. Consensus is also useful for group membership and transaction commit. 1.2 The idea of Paxos A consensus algorithm decides on one from a set of input values (such as the state machine inputs). It uses a set of processes, called agents in this paper. The simplest form of consensus de-cides when a majority of


View Full Document

MIT 6 852 - The ABCD’s of Paxos

Download The ABCD’s of Paxos
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The ABCD’s of Paxos and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The ABCD’s of Paxos 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?