EpidemicsContext“This course is designed to cover some of the key ideas that haveproved useful or are expected to be useful for designing andbuilding tomorrow's distributed systems. The course focuses onfundamentals.”Some fundamentals Atomic Commit Reliable Broadcast Consensus Isis Toolkit [Birman, van Renesse et al.]Y. Amir, D. Dolev, S. Kramer, and D. Malki. Transis: A communication sub-system for high availability. In Proc. 22nd Annual International Symposium on Fault-Tolerant Computing, pages 76--84, July 1992.Scalability Long message delays Unreliable communication Network partitionsA. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis,D. Swinehart, and D. Terry. Epidemic algorithms for replicated databasemaintenance. In Proceedings of the 6th Annual ACM Symposium on Principlesof Distributed Computing, Vancouver, BC, August 1987, pp. 1-12.Setup Database replicated at thousands of sitesacross the nation Unreliable point-to-point links Crash failure model Updates injected at a single site Updates must propagate to all other sites Want contents of all replicas to be identical ifupdates stop and system left aloneNotation S is a set of n sites (replicas) Pretend there is only one database entry v is the value of the database entry t is the timestamp associated with v Timestamps are totally ordered All sites are consistent iff s, s’ S : s.v = s’.vBroadcastIdea: If an update is injected at site s, then s mails theupdate to every other site in SUpon an update at site s: for each s’ S \ {s} dosend (Update, s.v, s.t) to s’ endloopUpon receiving (Update, v’, t’): if s.t < t’ thens.v := v’s.t := t’ endifWeakness: send is unreliable what if crashes occurAnti-entropyIdea: Every site regularly chooses another site at randomand exchanges database contents with it to resolvedifferences.Each server s periodically executes: for some s’ S \ {s} doResolveDifference(s,s’) endloopPush: ResolveDifference(s,s’) { if s.t > s’.t then s’.v := s.vs’.t = s.t endifPull: ResolveDifference(s,s’) { if s.t < s’.t then s.v := s’.vs.t := s’.t endifPush vs. pull analysisLet pi be the probability that a site still has notbeen updated by the ith try at anti-entropyFor large values of n: Push: pi+1 = pi e-1Pull: pi+1 = (pi )2Converges much faster for small piExample using pull mechanisms1s2s3s4(v, 1)Updatebroadcast(v, 1)anti-entropy(v, 1)(v, 2)broadcast(v, 2)(v, 2)(v, 2)Anti-entropy facts Guaranteed to eventually propagate updateto everyone with probability 1 Anti-entropy infects everyone in O(log n)time for uniformly chosen sites Good backup mechanism for direct mail Weakness: must go through entire databaseEpidemics Complex epidemics Sites can become “cured” Terminology: susceptible, infective, removed Strengths: sites do not mail everyone and do nothave to enumerate entire database Weakness: some may be left susceptible Resilient against unreliable communication Anti-entropy is a simple epidemicRumor mongering (informal) All sites start out susceptible When a site s receives a new update, itbecomes infective s periodically chooses another site s’ If s’ does not know the rumor, then itreceives the update and also becomesinfective If s’ already knows the rumor, then sbecomes removed with some probabilityRumor mongering protocolFor a site s:let L be a list of (initially empty) infective updatesperiodically:for some s’ S \ {s} do for each update u L send u to s’if s’ already knows about u then remove u from L with probability 1/k end loopend loopupon receiving new update u:insert u into LAnalysis of rumor mongeringi = fraction of infective sitess = fraction of susceptible sitesr = fraction of removed sitessidtds=isksidtdi)1(1+=kskkdsdi 11++=dskskkdi ++=11cksskksi +++=ln1)(kkc1+=()ksskksiln11)( ++=()()skes+=11Rumor mongering facts Expected fraction of susceptible sitess = e-(k+1)(1-s)Back up mongering with anti-entropy Redistribution of update Rumor mongering vs. broadcast Consider case when half of sites receive update Old rumors die fastDeath and its consequences Replace deleted item with a deathcertificate = (NIL, tnow) Provided no further updates, a deathcertificate eventually “deletes” all copies ofan item…but when? Problem: what if a single site is down?Death certificates Death certificate contains two values t – time of deletion t1 – threshold value, all servers discard deathcertificate after time t + t1Dormant death certificates Death certificate contains four values R – set of sites that keep a dormant death certificateafter t + t1 t – time of deletion t1 – all servers not in R discard death certificate aftertime t + t1 t2 – all servers discard the certificate after t + t2 Interaction with anti-entropy?Dormant death certificates Death certificate contains five values R – set of sites that keep a dormant death certificateafter ta + t1 t – time of deletion ta – time of activation t1 – all servers not in R discard certificate after ta + t1 t2 – all servers discard the certificate after ta + t2Bimodal multicastK.P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky. BimodalMulticast. ACM Transactions on Computer Systems. 17(2): 41-88. May 1999Class I – Strong reliability Properties: agreement, validity, termination, integrity Expensive and limited scalability Unpredictable performance undercongestion Degraded throughput under transientfailures Why? Full buffers and flow controlClass II – Best effort reliability “If a participating process discovers a failure,a reasonable effort is made to overcome it.” Better scalability than Class I protocols Difficult to reason about systems withoutconcrete guaranteesBimodal multicast claims Provides predictable reliability and steadythroughput under highly perturbed conditions Very small probability only a few processesdeliver High probability almost everyone delivers “Vanishingly small probability” in betweenSystem assumptions At least 75% of healthy processes willrespond to incoming messages within aknown bound 75% of messages will get through thenetwork Crash failure
View Full Document