MembershipA Gossip-Style Failure Detection ServiceOutlineMotivationSlide 5System ModelCharacterization & ApproachRequirementsSlide 9Mechanism of the AlgorithmSlide 11Slide 12Slide 13Slide 14Parameters to tune the AlgorithmSlide 16AnalysisSlide 18Slide 19Slide 20Slide 21Slide 22Multi-level GossipingDiscussionConclusionOn Scalable and Efficient Distributed Failure DetectorsSlide 27Motivation & Previous WorkSlide 29Slide 30Slide 31Slide 32Worst-case Network LoadOptimal Worst-case Network LoadSlide 35Slide 36Slide 37Randomized Distributed Failure Detector ProtocolSlide 39Slide 40Slide 41Analysis and ParametersExperimental ResultSlide 44Slide 45SWIM: Scalable Weakly-consistent Infection-style Process Group Memberhip ProtocolQuestions:Group Membership ServiceLarge Scale Process Group requires ScalabilitySWIM Actions:Scalable Failure DetectorsMinimal Network Load L*HeartbeatingSWIM: Failure DetectorSWIM vs HeartbeatingSWIM Failure DetectorDisseminationDissemination OptionsInfection-style DisseminationSlide 60Suspicion MechanismSlide 62Slide 63Time-bounded CompletenessExperiments Set-UpSlide 66Slide 67Slide 68Slide 69Slide 70Answers:Critiques/QuestionsMembershipDongyun JinThuy NguyenCS525 PresentationA Gossip-Style Failure Detection ServiceR. v. Renesse Y. MinskyM. Hayden3OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion4MotivationWhy do we need Failure Detection?System ManagementReplicationLoad BalancingWhen we detect a Failure in the system,Not using itMoving responsibilityMaking another copy5Motivation26810161722File “reviews.doc”6System ModelFail-stop model (not Byzantine Fault)Minimal assumptions about the networkSome message lostMost messages are delivered within predetermined, reasonable time.Goal : detecting failures at each hostCompletenessAccuracySpeed7Characterization & Approach{Strong/Weak} COMPLETENESSFailure of any member is detected by {all/some} non-faulty membersSTRONG ACCURACYno mistakeCannot achieve bothGuarantees completeness always and accuracy with high probability.8RequirementsCOMPLETENESSSPEEDEvery failure is detected by some non-faulty member within T time unitsACCURACYThe probability that non-faulty member(not yet detected as failed) is detected as faulty by any other non-faulty member should be less than Pmistake9OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion10Mechanism of the Algorithm1243AddressHeartbeat CounterTimeGossiping this list to othersAnd when a node receives it, merge this list with its list1 10120 662 10103 623 10098 634 10111 6511Mechanism of the Algorithm11 10120 662 10103 623 10098 634 10111 65243Gossiping this list to othersAnd when a node receives it, merge this list with its list1 10118 642 10110 643 10090 584 10111 651 10120 702 10110 643 10098 704 10111 65Current time : 70 at node 212Mechanism of the AlgorithmIf the heartbeat has not increased for more than Tfail seconds, the member is considered failedAnd after Tcleanup seconds, it will delete the member from the list13Mechanism of the AlgorithmThe reason they don’t delete a host right after Tfail seconds?11 10120 662 10103 623 10098 554 10111 652431 10120 662 10110 643 10098 504 10111 651 10120 662 10110 644 10111 651 10120 662 10110 643 10098 754 10111 65Current time : 75 at node 214OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion15Parameters to tune the AlgorithmGiven Pmistake,Tfail TcleanupRate of gossiping, TgossipChoose Tfail so that erroneous failure detection -> PmistakeChoose Tcleanup to be double of TfailChoose Rate of gossiping depending on network bandwidth16Parameters to tune the Algorithmfailuretimet+Tfailt+Tcleanup=t+2*Tfailt17AnalysisSimplified AnalysisIn a round, only one member can gossip to another member with the success probability ParrivalBy using dynamic method, we get the required number of rounds to achieve a certain quality of detection.As # round increases, Pmistake(r) decreaseDetection Time : Multiply the # round when Pmistake(r) ≤ Pmistake , by Tgossip18AnalysisAs # members increases, the detection time increase19AnalysisAs requirement is loosened, the detection time decrease.20AnalysisAs # failed members increases, the detection time increase significantly21AnalysisThe algorithm is resilient to message loss22OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion23Multi-level GossipingIn subnet i, with probability 1/ni24DiscussionWe might be able to use other gossip method to improve the performanceA hybrid of pull and push?Sending only new contents?It is consuming a lot of network resource.25ConclusionA failure detection based on a gossip protocol.Accurate with known probabilityResilient against message lossSimple analysis of this algorithmMulti-level GossipingOn Scalable and Efficient Distributed Failure DetectorsI. GuptaT. D. ChandraG. S. Goldszmidt27OutlineMotivation & Previous WorkProblem StatementWorst-case Network LoadRandomized Distributed Failure Detector ProtocolAnalysis and Experimental ResultDiscussionConclusion28Motivation & Previous WorkMost distributed applications rely on failure detector algorithms to avoid impossibility resultThe Heartbeating algorithms are not as efficient and scalable as claimedPrevious analysis model didn’t consider scalability.29Characterization & Approach{Strong/Weak} COMPLETENESSFailure of any member is detected by {all/some} non-faulty membersSTRONG ACCURACYno mistakeCannot achieve bothGuarantees completeness always and accuracy with high probability.30RequirementsCOMPLETENESSSPEEDEvery failure is detected by some non-faulty member within T time unitsACCURACYThe probability that non-faulty member(not yet detected as failed) is detected as faulty by any other non-faulty member should be less than PM(T)31System ModelA large group of n membersEach member knows each otherCrash (Non-Byzantine) FailureMessage loss rate pmlMember Failure
View Full Document