1MembershipDongyun JinThuy NguyenCS525 PresentationA Gossip-Style Failure Detection ServiceR. v. RenesseY. MinskyM. Hayden3OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion4MotivationWhy do we need Failure Detection?System ManagementReplicationLoad BalancingWhen we detect a Failure in the system,Not using itMoving responsibilityMaking another copy5Motivation26810161722File “reviews.doc”6System ModelFail-stop model (not Byzantine Fault)Minimal assumptions about the networkSome message lostMost messages are delivered within predetermined, reasonable time.Goal : detecting failures at each hostCompletenessAccuracySpeed27Characterization & Approach{Strong/Weak} COMPLETENESSFailure of any member is detected by {all/some} non-faulty membersSTRONG ACCURACYno mistakeCannot achieve bothGuarantees completeness always and accuracy with high probability.8RequirementsCOMPLETENESSSPEEDEvery failure is detected by some non-faulty member within T time unitsACCURACYThe probability that non-faulty member(not yet detected as failed) is detected as faulty by any other non-faulty member should be less than Pmistake9OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion10Mechanism of the Algorithm1243AddressHeartbeat CounterTimeGossiping this list to othersAnd when a node receives it, merge this list with its list6210103263100983661012016510111411Mechanism of the Algorithm162101032631009836610120165101114243Gossiping this list to othersAnd when a node receives it, merge this list with its list6410118165101114581009036410110270101201651011147010098364101102Current time : 70 at node 212Mechanism of the AlgorithmIf the heartbeat has not increased for more than Tfailseconds, the member is considered failedAnd after Tcleanupseconds, it will delete the member from the list313Mechanism of the AlgorithmThe reason they don’t delete a host right after Tfailseconds?1621010325510098366101201651011142436610120165101114501009836410110266101201651011146410110266101201651011147510098364101102Current time : 75 at node 214OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion15Parameters to tune the AlgorithmGiven Pmistake,TfailTcleanupRate of gossiping, TgossipChoose Tfailso that erroneous failure detection -> PmistakeChoose Tcleanupto be double of TfailChoose Rate of gossiping depending on network bandwidth16Parameters to tune the Algorithmfailuretimet+Tfailt+Tcleanup=t+2*Tfailt17AnalysisSimplified AnalysisIn a round, only one member can gossip to another member with the success probability ParrivalBy using dynamic method, we get the required number of rounds to achieve a certain quality of detection.As # round increases, Pmistake(r) decreaseDetection Time : Multiply the # round when Pmistake(r) ≤ Pmistake, by Tgossip18AnalysisAs # members increases, the detection time increase419AnalysisAs requirement is loosened, the detection time decrease.20AnalysisAs # failed members increases, the detection time increase significantly21AnalysisThe algorithm is resilient to message loss22OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion23Multi-level GossipingIn subnet i, with probability 1/ni24DiscussionWe might be able to use other gossip method to improve the performanceA hybrid of pull and push?Sending only new contents?It is consuming a lot of network resource.525ConclusionA failure detection based on a gossip protocol.Accurate with known probabilityResilient against message lossSimple analysis of this algorithmMulti-level GossipingOn Scalable and Efficient Distributed Failure DetectorsI. GuptaT. D. ChandraG. S. Goldszmidt27OutlineMotivation & Previous WorkProblem StatementWorst-case Network LoadRandomized Distributed Failure Detector ProtocolAnalysis and Experimental ResultDiscussionConclusion28Motivation & Previous WorkMost distributed applications rely on failure detector algorithms to avoid impossibility resultThe Heartbeating algorithms are not as efficient and scalable as claimedPrevious analysis model didn’t consider scalability.29Characterization & Approach{Strong/Weak} COMPLETENESSFailure of any member is detected by {all/some} non-faulty membersSTRONG ACCURACYno mistakeCannot achieve bothGuarantees completeness always and accuracy with high probability.30RequirementsCOMPLETENESSSPEEDEvery failure is detected by some non-faulty member within T time unitsACCURACYThe probability that non-faulty member(not yet detected as failed) is detected as faulty by any other non-faulty member should be less than PM(T)631System ModelA large group of n membersEach member knows each otherCrash (Non-Byzantine) FailureMessage loss rate pmlMember Failure Prebability Pfqml= (1 – pml)qf= (1 – pf)32OutlineMotivation & Previous WorkProblem StatementWorst-case Network LoadRandomized Distributed Failure Detector ProtocolAnalysis and Experimental ResultDiscussionConclusion33Worst-case Network LoadSCALEThe worst-case network load L is close to the optimal worst-case network load as possibleEqual expected load per memberDefinitionThe worst-case network load L of a failure detector protocol is the maximum number of messagestransmitted by any run of the protocol within any time interval of length T, divided by T34Any distributed failure detector algorithm imposes a minimal worst-case network load of , forGroup of size nSatisfying COMPLETENESS, SPEED, ACCURACYGiven values of T and PM(T)TpTPMnLml⋅⋅=)log())(log(*Optimal Worst-case Network Load35Optimal Worst-case Network LoadA group member at a random point in time t is not detected as failed yet and stays non-faulty until at least time t+TIt sends m messages for time TIf all messages are dropped, it need to be detected as failed (SPEED)Its probability should be less than PM(T))(TPMpmml≤)log())(log(mlpTPMm ≥36Worst-case Network
View Full Document