U of I CS 525 - A Gossip-Style Failure Detection Service - D1672504

Home> Schools> University of Illinois> Computer Science (CS) > CS 525> A Gossip-Style Failure Detection Service

DOC PREVIEW

U of I CS 525 - A Gossip-Style Failure Detection Service

School name University of Illinois

Course Cs 525- Advanced Graphics Processor Programming

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1MembershipDongyun JinThuy NguyenCS525 PresentationA Gossip-Style Failure Detection ServiceR. v. RenesseY. MinskyM. Hayden3OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion4MotivationWhy do we need Failure Detection?System ManagementReplicationLoad BalancingWhen we detect a Failure in the system,Not using itMoving responsibilityMaking another copy5Motivation26810161722File “reviews.doc”6System ModelFail-stop model (not Byzantine Fault)Minimal assumptions about the networkSome message lostMost messages are delivered within predetermined, reasonable time.Goal : detecting failures at each hostCompletenessAccuracySpeed27Characterization & Approach{Strong/Weak} COMPLETENESSFailure of any member is detected by {all/some} non-faulty membersSTRONG ACCURACYno mistakeCannot achieve bothGuarantees completeness always and accuracy with high probability.8RequirementsCOMPLETENESSSPEEDEvery failure is detected by some non-faulty member within T time unitsACCURACYThe probability that non-faulty member(not yet detected as failed) is detected as faulty by any other non-faulty member should be less than Pmistake9OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion10Mechanism of the Algorithm1243AddressHeartbeat CounterTimeGossiping this list to othersAnd when a node receives it, merge this list with its list6210103263100983661012016510111411Mechanism of the Algorithm162101032631009836610120165101114243Gossiping this list to othersAnd when a node receives it, merge this list with its list6410118165101114581009036410110270101201651011147010098364101102Current time : 70 at node 212Mechanism of the AlgorithmIf the heartbeat has not increased for more than Tfailseconds, the member is considered failedAnd after Tcleanupseconds, it will delete the member from the list313Mechanism of the AlgorithmThe reason they don’t delete a host right after Tfailseconds?1621010325510098366101201651011142436610120165101114501009836410110266101201651011146410110266101201651011147510098364101102Current time : 75 at node 214OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion15Parameters to tune the AlgorithmGiven Pmistake,TfailTcleanupRate of gossiping, TgossipChoose Tfailso that erroneous failure detection -> PmistakeChoose Tcleanupto be double of TfailChoose Rate of gossiping depending on network bandwidth16Parameters to tune the Algorithmfailuretimet+Tfailt+Tcleanup=t+2*Tfailt17AnalysisSimplified AnalysisIn a round, only one member can gossip to another member with the success probability ParrivalBy using dynamic method, we get the required number of rounds to achieve a certain quality of detection.As # round increases, Pmistake(r) decreaseDetection Time : Multiply the # round when Pmistake(r) ≤ Pmistake, by Tgossip18AnalysisAs # members increases, the detection time increase419AnalysisAs requirement is loosened, the detection time decrease.20AnalysisAs # failed members increases, the detection time increase significantly21AnalysisThe algorithm is resilient to message loss22OutlineMotivationSystem ModelMechanism of the AlgorithmParameters to tune the AlgorithmAnalysisMulti-level GossipingDiscussionConclusion23Multi-level GossipingIn subnet i, with probability 1/ni24DiscussionWe might be able to use other gossip method to improve the performanceA hybrid of pull and push?Sending only new contents?It is consuming a lot of network resource.525ConclusionA failure detection based on a gossip protocol.Accurate with known probabilityResilient against message lossSimple analysis of this algorithmMulti-level GossipingOn Scalable and Efficient Distributed Failure DetectorsI. GuptaT. D. ChandraG. S. Goldszmidt27OutlineMotivation & Previous WorkProblem StatementWorst-case Network LoadRandomized Distributed Failure Detector ProtocolAnalysis and Experimental ResultDiscussionConclusion28Motivation & Previous WorkMost distributed applications rely on failure detector algorithms to avoid impossibility resultThe Heartbeating algorithms are not as efficient and scalable as claimedPrevious analysis model didn’t consider scalability.29Characterization & Approach{Strong/Weak} COMPLETENESSFailure of any member is detected by {all/some} non-faulty membersSTRONG ACCURACYno mistakeCannot achieve bothGuarantees completeness always and accuracy with high probability.30RequirementsCOMPLETENESSSPEEDEvery failure is detected by some non-faulty member within T time unitsACCURACYThe probability that non-faulty member(not yet detected as failed) is detected as faulty by any other non-faulty member should be less than PM(T)631System ModelA large group of n membersEach member knows each otherCrash (Non-Byzantine) FailureMessage loss rate pmlMember Failure Prebability Pfqml= (1 – pml)qf= (1 – pf)32OutlineMotivation & Previous WorkProblem StatementWorst-case Network LoadRandomized Distributed Failure Detector ProtocolAnalysis and Experimental ResultDiscussionConclusion33Worst-case Network LoadSCALEThe worst-case network load L is close to the optimal worst-case network load as possibleEqual expected load per memberDefinitionThe worst-case network load L of a failure detector protocol is the maximum number of messagestransmitted by any run of the protocol within any time interval of length T, divided by T34Any distributed failure detector algorithm imposes a minimal worst-case network load of , forGroup of size nSatisfying COMPLETENESS, SPEED, ACCURACYGiven values of T and PM(T)TpTPMnLml⋅⋅=)log())(log(*Optimal Worst-case Network Load35Optimal Worst-case Network LoadA group member at a random point in time t is not detected as failed yet and stays non-faulty until at least time t+TIt sends m messages for time TIf all messages are dropped, it need to be detected as failed (SPEED)Its probability should be less than PM(T))(TPMpmml≤)log())(log(mlpTPMm ≥36Worst-case Network

View Full Document