CS514: Intermediate Course in Operating SystemsQuorum replicationToday’s topicA peek at the conclusionStatic membershipStatic membership exampleQuorumsVersions of replicated dataDoing a read is easyDoing a write is trickierThe sequence of eventsWhich votes got counted?What’s with the [t,pid] stuff?ExampleTimestamps seen by the clients?Where are the updates?Slide 17Slide 18When are updates performed?Slide 20What if “my vote wasn’t used?”Recovery from a crashRead requests received when updates are pending wait…Why is this “safe”?Slide 25What about our rule for votes that didn’t count?Mile high: Why this worksObservationsOur protocol is a 3PC!Risk of blockingPerformance implications?Slide 32So why even consider them?Byzantine QuorumsTypical approach?Costs? Benefits?Virtual synchronyState MachinesTake away?CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanVivek Vishnumurthy: TAQuorum replicationWe developed a whole architecture based on our four-step recipeBut there is a second major approach that also yields a complete group communication framework and solutionsBased on “quorum” read and write operationsOmits notion of process group viewsToday’s topicQuorum methods from a mile highDon’t have time to be equally detailedWe’ll exploreHow the basic read/update protocol worksFailure considerationsState machine replication (a form of lock-step replication for deterministic objects)Performance issuesA peek at the conclusionThese methods areWidely known and closely tied to consensusPerhaps, easier to implementBut they have serious drawbacks:Need deterministic componentsAre drastically slower (10s-100s of events/second)Big win?Recent systems combine quorums with Byzantine Agreement for ultra-sensitive databasesStatic membershipSubsets of a known set of processesE.g. a cluster of five machines, each running replica of a database serverMachines can crash or recover but don’t depart permanently and new ones don’t join “out of the blue”In practice the dynamic membership systems can easily be made to work this way… but usually aren’tStatic membership examplepqrstclientQread = 2, Qwrite = 4read write read Write failsTo do a read, this client (or even a group member) must access at least 2 replicasTo do a write, must update at least 4 replicasThis write will fail: the client only manages to contact 2 replicas and must “abort” the operation (we use this terminology even though we aren’t doing transactions)QuorumsMust satisfy two basic rules1. A quorum read should “intersect” any prior quorum write at >= 1 processes2. A quorum write should also intersect any other quorum writeSo, in a group of size N:1. Qr + Qw > N, and2. Qw + Qw > NVersions of replicated dataReplicated data items have “versions”, and these are numberedI.e. can’t just say “Xp=3”. Instead say that Xp has timestamp [7,q] and value 3Timestamp must increase monotonically and includes a process id to break tiesThis is NOT the pid of the update source… we’ll see where it comes fromDoing a read is easySend RPCs until Qr processes replyThen use the value with the largest timestampBreak ties by looking at the pidFor example[6,x] < [9,a] (first look at the “time”)[7,p] < [7,q] (but use pid as a tie-breaker)Even if a process owns a replica, it can’t just trust it’s own data. Every “read access” must collect Qr values first…Doing a write is trickierFirst, we can’t support incremental updates (x=x+1), since no process can “trust” its own replica.Such updates require a read followed by a write.When we initiate the write, we don’t know if we’ll succeed in updating a quorum of processeswE can’t update just some subset; that could confuse a readerHence need to use a commit protocolMoreover, must implement a mechanism to determine the version number as part of the protocol. We’ll use a form of votingThe sequence of events1. Propose the write: “I would like to set X=3”2. Members “lock” the variable against reads, put the request into a queue of pending writes (must store this on disk or in some form of crash-tolerant memory), and send back:“OK. I propose time [t,pid]”Here, time is a logical clock. Pid is the member’s own pid3. Initiator collects replies, hoping to receive Qw (or more) Qw OKsCompute maximum of proposed [t,pid] pairs.Commit at that timeAbort< Qw OKsWhich votes got counted?It turns out that we also need to know which votes were “counted”E.g. suppose there are five group members, A…E and they vote:{[17,A] [19,B] [20,C] [200,D] [21,E]}But somehow the vote from D didn’t get through and the maximum is picked as [21,E]We’ll need to also remember that the votes used to make this decision were from {A,B,C,E}What’s with the [t,pid] stuff?Lamport’s suggestion: use logical clocksEach process receives an update messagePlaces it in an ordered queueAnd responds with a proposed time: [t,pid] using its own process id for the timeThe update source takes the maximumCommit message says “commit at [t,pid]”Group members who’s votes were considered deliver committed updates in timestamp orderGroup members who votes were not considered discard the update and don’t do it, at all.One message is lost…ExamplepqrstclientaProposed write…clientbProposed write…B: [0,p]A: [0,s]A: [0,t]… and a third delayed for a long time… another delayed for a little whileB: [1,s]B: [1,t]A: [1,r]A: [1,p]B: [0,r]Timestamps seen by the clients?A sees (in order):[0,p], [0,s], [1,p], [1,r]This is a write quorum (Qw=4)A picks [1,r] from {p,r,s} as the largest “time”B sees[0,r], [0,p], [1,t], [1,s]B picks [1,t] from {p,r,s,t} as the largest time.Note that [1,r] < [1,t], so A goes firstWhere are the updates?Each member has a queue of pending, uncommitted updatesEven if a member crashes and restarts, it remembers this pending queueExample: at process p the queue has{B: [0,p]}; {A: [1,p]}Neither can be delivered (acted upon) since neither time is committed yetRight now, process p can only respond to reads using the old value of the variable!ExamplepqrstclientaProposed write…clientbProposed write…Commit at [1,r]Commit at [1,t] from {p,r,s,t}ExamplepqrstclientaProposed write…clientbProposed write…Commit at [1,r]
View Full Document