Unformatted text preview:

iSYee JiunSongCornell University. CS5410 Fall 2008. Fault Tolerant Systemsy By now, probably obvious that systems re liability/availability is a key concerny Downtime is expensivey Replication is a general technique for providing fault toleranceReplicationunreplicated serviceclientserverReplicationunreplicated servicereplicated serviceclientclientserverserverreplicasreplicasReplicationy Applications as deterministic state machinesy Reduce the problem of replication to that of agreementy Ensure that replicas process requests in the same order:Sft li t b iitt bh iySafety: clients never observe inconsistent behaviory Liveness: system is always able to make progressTraditional Assumptionsy Synchronyy Bounded difference in CPU speedsy Bounded time for message deliver yy Benign/Crash faultsg/y When machines fail, they stop producing output immediately, and forever. What if these assumptions don’t hold?pAsynchronyy In the real world, systems are never quite as synchronous as we would likeAh i iii i l yAsynchrony is a pessimistic assumption to capture real world phenomenonyMessages will eventually be delivered processors will yMessages will eventually be delivered, processors will eventually complete computation. But no bound on time.y In general:y OK to assume synchrony when providing liveness( ) h f fy Dangerous (NOT OK) to assume synchrony for safetyByz antine Faultsy Crash faults are a strong assumptiony In practice, many kinds of problems can manifest:y Bit flip in memoryy Intermittent network errorsMlii tt kyMalicious attacksy Byzantine faults: strongest failure modelyCompletely arbitrary behavior of faulty nodesyCompletely arbitrary behavior of faulty nodesByz antine Agreementy Can we build systems that tolerate Byzantine failures and asynchrony? YES!U li i Bi l yUse replication + Byzantine agreement protocol to order requestsyCostyCosty At least 3t+1 replicas (5t+1 for some protocols)yCommunication overheadCommunication overheady Safety in the face of Byzantine faults and asynchronyy Liveness in periods of synchronyp yyPBFTy Castro and Liskov. “Practical Byzantine Fault Tolerance.” OSDI99.Th fi li i lih h i yThe first replication algorithm that integra tes Byzantine agreementyDemonstrates that Byzantine FaultTolerance is not yDemonstrates that Byzantine Fault‐Tolerance is not prohibitively expensiveySparked off a thread of research that led to the Sparked off a thread of research that led to the development of many Byzantine fault‐tolerant algorithms and systemsPBFT: Overviewy Servers are replicated on 3t+1 nodesy One particular server is called the primary. Also called h ld h dithe leader or the coordinatory A continuous period of time during which a server stays as the primaryis called a view or a configurationstays as the primaryis called a view, or a configurationPBFT NlOtiPBFT: Normal Operationy Fixed primary within a viewy Client submits request to primaryy Primary orders requests and sends them to all nodesy Client waits for identical replies from at least t+1 nodesreplicasclientprimaryviewClienty Waits for t+1 identical repliesy Why is this sufficient?yAt most tfailures. So at least one of the (t+1) replies must be from a correct node.yPBFT ensures that nonfaulty nodes never go into a bad yPBFT ensures that non‐faulty nodes never go into a bad state, so their responses are always valid.y Difficult: How to ensure this is the case?y If client times out before receiving sufficient replies, broadcast request to all replicasPhase 1: PrepreparePhase 1: Pre‐preparerequest : mPRE-PREPARE,v,n,m0primary = replica 0replica 1replica2replica 2replica 3failPrimary assigns the request with a sequence number nReplicas accept pre-prepare if:ii•in view v• never accepted pre-prepare for v,n with different requestPhase 2: PreparePhase 2: PreparemPREPARE,v,n,D(m),11mpreparereplica 0replica 1replica2replica 2replica 3failcollect pre-prepare and 2fmatching preparespppgp pP-certificate(m,v,n)Phase 2: Preparey Each replica collects 2f prepare msgs:y 2f msgs means that 2f+1 replicas saw the same pre‐prepare msg At least f+1 of these must be honestmsg. At least f+1 of these must be honesty Since there are only 3f+1 replicas, this means that there cannot exist more than 2f replicas that received a conflicting pre‐prepare msg or claim to have received oneprepare msg or claim to have received oney All correct replicas that receive 2f prepare msgs for a <v, n, m> tuple received consistent msgsPhase 3: CommitPhase 3: CommitmcommitCOMMIT,v,n,D(m),22repliesreplica 0replica1replica 1replica 2failreplica 3failRequest m executed after:havingCcertificate(m v n)all collect 2f+1 matching commitsC-certificate(m,v,n)•having C-certificate(m,v,n)• executing requests with sequence number less than nPhase 3: Commity If a correct replica p receives 2f+1 matching commit msgsA l f li hi yAt least f+1 correct replicas sent matching msgsy No correct replica can receive 2f+1 matching commit msgsthat contradict with the ones that psawmsgsthat contradict with the ones that psawy In addition, phase 2 ensures that correct replicas send the same commit msgs, so, tog ether with the view change protocol, correct replicas will eventually commitWhy does this work?y When a replica has collected sufficient prepared msgs, it knows that sufficient msgs cannot be collected for any other request with that sequence number in that any other request with that sequence number, in that viewyWhen a replica collects sufficient commitmsgs it When a replica collects sufficient commitmsgs, it knows that eventually at least f+1 non‐faulty replicas will also do the samey Formal proof of correctness is somewhat involved. Refer to paper. Drop by my office (320 Upson) if you need helpneed help.View Changey What if the primary fails? View change!yProvides liveness when the primary failsyProvides liveness when the primary failsy New primary = view number mod NyTriggered by timeouts Recall that the client yTriggered by timeouts. Recall that the client broadcasts the request to all replicas if it doesn’t rec eiv e sufficient consistent requests after some amount of time. This triggers a timer in the replicas.View Changey A node starts a timer if it receiv es a request that it has not executed. If the timer expires, it starts a view change protocolchange protocol.y Each node that hits the timeout broadcasts


View Full Document

CORNELL CS 5410 - Lecture Notes

Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?