MIT 6 830 - Paxos Made Live - An Engineering Perspective - D3044738

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 830> Paxos Made Live - An Engineering Perspective

DOC PREVIEW

MIT 6 830 - Paxos Made Live - An Engineering Perspective

School name Massachusetts Institute of Technology

Course 6 830- Database Systems

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Paxos Made Live - An Engineering PerspectiveTushar ChandraRobert GriesemerJoshua RedstoneJune 26, 2007AbstractWe describe our experience bu ilding a fault-tolerant data-base using t he Paxos consensus algorithm.Despite the existing literature in the field, building such a database proved to be non-trivial. We describeselected algorithmic and engineering problems encountered, and the solutions we found for them. Ourmeasurements indicate that we have built a competitive system.1 IntroductionIt is well known that fault-tolerance on commodity hardware can be achieved through replication [17, 18]. Acommon approach is to use a consensus algorithm [7] to ensure that all r e plicas are mutually consistent [8 ,14, 17]. By repeatedly applying such an algorithm on a sequence of input values, it is possible to build anidentica l log of values on each replica. If the values are operations on some data structure, application ofthe same log on all replicas may be used to a rrive at mutually consistent data structures on all replicas. Forinstance, if the log contains a sequence of database operations, and if the same sequence of operations isapplied to the (local) database on each re plica, eventually all replica s will end up with the same databasecontent (provided that they all started with the same initial database state).This general approach can be used to implement a wide variety o f fault-tolerant primitives, of which afault-tolerant database is just an example. As a result, the consensus problem has been s tudied ex tensivelyover the past two decades. There are several well-known consensus algorithms that operate within a multitudeof settings and which toler ate a variety of failures. The Paxos consensus algorithm [8] has been discussed inthe theoretical [16] and applied community [10, 11, 12] for over a dec ade.We used the Paxos algorithm (“Paxos” ) as the base for a framework that implements a fault-tolerantlog. We then relied o n that framework to build a fault-tolerant database. Despite the existing literature onthe subject, building a production system turned out to be a non-trivial task for a variety of reaso ns:• While Paxos can be described with a page of pseudo-code, our complete implementation contains severalthousand lines of C++ code. The blow-up is not due simply to the fact that we used C++ ins teadof pseudo notation, nor because our code style may have been verbose. Converting the algorithm intoa practical, production-ready system involved implementing many features and optimizations – somepublished in the literature and some not.• The fault-tolerant algorithms community is a ccustomed to proving short algorithms (one page of pseudocode) correc t. This approa ch does not sca le to a system with thousands of lines of code. To gainconfidence in the “correctness” of a real system, different methods had to be use d.• Fault-tolerant algorithms tolerate a limited set of carefully selected faults. However, the real worldexp oses software to a wide variety of failure modes, including e rrors in the algorithm, bugs in itscACM 2007. This is a minor revision of the work that wil l be published in the proceedings of ACM PODC 2007.1implementation, and operator error. We had to engineer the software and design operational proce duresto robustly handle this wider set of failure modes.• A real system is rarely specified precise ly. Even worse, the specification may change during the im-plementation phase. Consequently, an implementation should be malleable. Finally, a system might“fail” due to a misunderstanding that occurre d during its specification phase.This paper discusses a selection of the algorithmic and engineering challenges we encountered in movingPaxos from theo ry to practice. This exercise took more R&D efforts than a straightforward translation ofpseudo-code to C++ might suggest.The rest of this paper is organized as follows. The next two sections expand on the motiva tio n for thisproject and describe the general environment into which our system was built. We then provide a quickrefresher on Paxos. We divide our expe riences into three categorie s and discuss each in turn: algorithmic gapsin the literature, software engineering challenges, a nd unexpected failures. We co nclude with meas urementsof our system, and some broader observations on the state of the art in our field.2 BackgroundChubby [1] is a fault-tolerant system a t Google that provides a distributed locking mechanism and storessmall files. Typically there is one Chubby instance, or “cell”, per data center. Several Google systems – suchas the Google Filesystem (GFS) [4] and Bigtable [2] – use Chubby for distributed coordinatio n a nd to storea small amount of metadata.Chubby achieves fa ult-to lerance through replication. A typical Chubby cell consists of five replicas,running the same code, each running on a dedicated machine. Every Chubby object (e.g., a Chubby lock,or file) is stored as an entry in a database. It is this database that is replicated. At any one time, one ofthese replicas is considered to be the “ma ster”.Chubby clients (such as GFS and Bigtable) contact a Chubby cell for service. The master replica servesall Chubby requests. If a Chubby client contacts a replica that is not the master, the replica replies withthe master’s network addre ss. The Chubby client may then contact the master. If the mas ter fails, a newmaster is automatically elected, which will then continue to serve tra ffic based on the contents of its localcopy of the replicated database. Thus, the replicated database ensures continuity of Chubby state ac rossmaster failover.The first version of Chubby was based on a commerc ial, third-party, fault-toler ant databas e; we willrefer to this databa se as “3DB” for the rest of this pap er. This database had a history of bugs related toreplication. In fact, as far as we know, the replication mechanism was not based on a proven replicationalgorithm and we do not know if it is correct. Given the history of problems associated with tha t productand the importance of Chubby, we eventually decided to replace 3DB with our own solution based on thePaxos algorithm.3 Architecture outlineFigure 1 illustrates the architecture of a single Chubby replica. A fault-tolerant replicated log based on thePaxos algorithm sits at the bottom of the protocol s tack. Each replica maintains a local copy of the log. ThePaxos a lgorithm is r un repeatedly as

View Full Document