Unformatted text preview:

4/16/08 10:17:35 PM2pcpaxos1Brewer/Hellerstein CS262 Spring 2008: 2PC and PaxosA theme: two-phase protocolsCourtesy Jim Gray:Marriage Ceremony: "Do you?" "I do!" "I now pronounce you..."Theater: "Ready on the set?" "Ready!" "Action!"Contract Law: Offer. Signature. Deal/lawsuit.Actually these protocols are pretty simpleFussy to prove they're safe/correctEven fussier to tune them and maintain proofs, and that's where much of the sweat goes.Two Phase Commit and Logging in R*SetupRolescoordinator (transaction manager or TM)subordinate (resource manager, or RM)Goal: All or nothing agreement on commit (single subordinate veto is enough to abort).Also, integrate properly with log processing and recovery.AssumptionsUpdate in place, WALbatch-force log recordsDesired characteristicsguaranteed xact atomicityability to "forget" outcome of commit ASAPminimal log writes and message trafficoptimized performance in no-failure case (the "fast path")exploitation of completely or partially R/O xactsmaximize ability to perform unilateral abortIn order to minimize logging and comm:rare failures do not deserve extra overhead in normal processingHierarchical commit better than 2PThe basic 2PC protocol with logging (normal processing):Coordinator Log | Messages | Subordinate Log | PREPARE | | | prepare*/abort* | VOTE Y/N |commit*/abort* | | | C/A | | | commit*/abort* | ACK | end | |Rule: never need to ask something that you used to know! Log before ACKing.Since subords force abort/commit before ACKing, they never need to ask coord to remind them about final outcome.Costs:subords: 2 forced log-writes, 2 msgscoord: 1 forced log write, 1 async log write, 2 msgs per subordtotal: 4n messages, 2N+1 log writes. Delays: 4 message delays, 3 sync writes.we'll tune this down below2PC and failuresNote: 2PC systems are not available during a coordinator failure! Yuck!! (See Paxos Commit, below, for discussion)what about subordinate failure?Recovery process protocol:1 On restart, read log and accumulate committing xacts info in main mem2 if you discover a local xact in the prepared state, contact coord to find out fate3 if you discover a local xact that was not prepared, UNDO it, write abort record, forget4 if a local xact was committing (i.e. this is the coord), then send out COMMIT msgs to subords that haven't ACKed Similar for aborting.Upon discovering a failure elsewhereIf a coord discovers that a subord is unreachable...4/16/08 10:17:35 PM2pcpaxos2Brewer/Hellerstein CS262 Spring 2008: 2PC and PaxosTwo Phase Commit and Logging in R*2PC and failuresUpon discovering a failure elsewhereIf a coord discovers that a subord is unreachable...while waiting for its vote: coord aborts xact as usualwhile waiting for an ACK: coord gives xact to recovery mgrIf subord discovers that coord is unreachable...if it hasn't sent a YES vote yet, do unilateral abortif it has sent a YES vote subord gives xact to recovery mgrIf a recovery mgr receives an inquiry from a subord in prepared stateif main mem info says xact is committing or aborting, send COMMIT/ABORTif main mem info says nothing...?An aside: Hierarchical 2PCIf you have a tree-shaped process graphroot (which talks to user) is a coordleaves are subordsinterior nodes are bothafter receiving PREPARE, propagate to children.vote after children. any NO below causes a NO vote (this is like stratified aggregation!)after receiving COMMIT record, force-write log, ACK to parent, and propagate to children. similar for ABORT.Tuning approach 1: Presumed Abortrecall... if main-mem says nothing, coord says ABORTSO... coord can forget a xact immediately after deciding to abort it! (write abort record, THEN forget)abort can be async writeno ACKS required from subords on ABORTno need to remember names of subords in abort record, nor write end record after abortif coord sees subord has failed, need not pass xact to recovery system; can just ABORT.Look at R/O xacts:subords who have only read send READ VOTEs instead of YES VOTEs, release locks, write no log recordslogic is: READ & YES = YES, READ & NO = NO, READ & READ = READif all votes are READ, there’s no second phasecommit record at coord includes only YES sitesTallying up the R/O work: N+1 msgs, no disk writes. Delays: 1 msg delay.Tuning approach II: Presumed CommitShould be the fast path, can we do it fast?Inverting the logic:require ACK for ABORT, not COMMIT!subords force abort* record, not commitno info? presume commit!Problem!subord preparescoord crasheson restart, coord aborts and forgetssubord asks about the xact, coord says "no info = commit!"subord commits, but everybody else does not.Solution:coord records names of subords on stable storage before allowing them to prepare ("collecting" record)then it can tell them about aborts on restarteverything else analogous (mirror) to P.A.Tallying up R/O work: N+1 msgs, 2 diskwrites (collecting*, commit), Delays: 1 diskwrite delay, 1 msg delay.Costs of the variants2PC commit: 2N+2 writes, 4N messages. Delays: 3 write delays, 4 msg delaysPA commit: 2N+2 writes, 4N messages. Delays: 3 write delays, 4 msg delaysPC commit: 2N+2 writes, 3N messages. Delays: 3 write delays, 3 msg delays.PA always beats plain 2PCPA beats PC for R/O transactionsfor xacts with only one writer subord, PC beats PA (PA has an extra ACK from subord)for n-1 writer subords, PC much better than PA (PA forces n-1 times at subords on commits, sends n extra msgs)choice between PA and PC could be made on a xact-by-xact basis!4/16/08 10:17:35 PM2pcpaxos3Brewer/Hellerstein CS262 Spring 2008: 2PC and PaxosTwo Phase Commit and Logging in R*Costs of the variantsfor n-1 writer subords, PC much better than PA (PA forces n-1 times at subords on commits, sends n extra msgs)choice between PA and PC could be made on a xact-by-xact basis!"query" optimization? Overlog?PaxosSetup3 roles being playedA single Proposer ("Leader"), proposes "values"Leader-election protocol is well-known and predates this workAcceptor, part of protocol to decide on "choosing" valuesLearner, hears about "chosen" valuesGoal: majority agreement to "choose" a proposed valueImagine a single Consensus Box. Now emulate that with a distributed set of


View Full Document

Berkeley COMPSCI 262A - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?