U of I CS 525 - Advanced Distributed Systems - D2149236

Home> Schools> University of Illinois> Computer Science (CS) > CS 525> Advanced Distributed Systems

DOC PREVIEW

U of I CS 525 - Advanced Distributed Systems

School name University of Illinois

Course Cs 525- Advanced Graphics Processor Programming

Pages 59

This preview shows page 1-2-3-4-27-28-29-30-56-57-58-59 out of 59 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS 525 Advanced Distributed Systems Spring 09Target SettingsGroup Membership ServiceTwo sub-protocolsLarge Group: Scalability A GoalGroup Membership ProtocolII. Failure DetectorIII. DisseminationI. pj crashesII. Distributed Failure Detectors: PropertiesDistributed Failure Detectors: PropertiesWhat Real Failure Detectors PreferFailure Detector PropertiesSlide 14Slide 15Centralized HeartbeatingRing HeartbeatingAll-to-All HeartbeatingGossip-style HeartbeatingFD: Is this the “best” protocol ?Gossip-Style Failure DetectionSlide 22Slide 23Slide 24Pictorially…Analysis/DiscussionSimulationsMulti-level GossipingSlide 29Failure Detector Properties ……Are application-defined RequirementsSlide 32Slide 33Slide 34What’s the Best/Optimal we can do?HeartbeatingSWIM Failure Detector ProtocolSWIM versus HeartbeatingSWIM Failure DetectorAccuracy, LoadDetection TimeSlide 42Dissemination OptionsInfection-style DisseminationSlide 45Suspicion MechanismSlide 47Slide 48Time-bounded CompletenessResults from an ImplementationSlide 51Slide 52Slide 53Slide 54Slide 55Slide 56More discussion pointsQuestionsRandomized FD - AnalysisCS 525 Advanced Distributed SystemsSpring 09Indranil Gupta (Indy)Lecture 5Failure Detectors and MembershipMarch 17, 2009Target Settings•Process ‘group’-based systems–Clouds/Datacenters –Replicated servers–Distributed databases•Crash-stop process failuresGroup Membership ServiceApplication QueriesApplication Queries e.g., gossip, DHT’se.g., gossip, DHT’sMembershipProtocolGroup Membership List joins, leaves, failuresof membersUnreliable Unreliable CommunicationCommunicationApplication Process piTwo sub-protocolsDisseminationFailure DetectorApplication Process pi pjGroup Membership ListUnreliable Unreliable CommunicationCommunication•Almost-Complete list (focus of this talk)•Virtual synchrony, Gossip-style, SWIM, …•Or Partial-random list (other papers)•SCAMP, T-MAN, Cyclon,…Large Group: Scalability A Goalthis is us (pi)Unreliable CommunicationUnreliable CommunicationNetworkNetwork1000’s of processesProcess Group“Members”pj I pj crashed Group Membership ProtocolUnreliable CommunicationUnreliable CommunicationNetworkNetworkpiSome process finds out quicklyFailure DetectorIIDisseminationIIICrash-stop Failures onlyHOW ? pj crashed II. Failure DetectorUnreliable CommunicationUnreliable CommunicationNetworkNetworkpiSome process finds out quicklyFailure DetectorIIpj crashed III. DisseminationUnreliable CommunicationUnreliable CommunicationNetworkNetworkpiDisseminationHOW ?Failure DetectorSome process finds out quicklyI. pj crashes •Nothing we can do about it! •A frequent occurrence•Common case rather than exceptionII. Distributed Failure Detectors: Properties•Completeness = each failure is detected•Accuracy = there is no mistaken detection•Speed–Time to first detection of a failure•Scale–Equal Load on each member–Network Message LoadDistributed Failure Detectors: Properties•Completeness•Accuracy•Speed–Time to first detection of a failure•Scale–Equal Load on each member–Network Message LoadImpossible together in lossy networks [Chandraand Toueg]Can then solve consensus!What Real Failure Detectors Prefer•Completeness•Accuracy•Speed–Time to first detection of a failure•Scale–Equal Load on each member–Network Message LoadGuaranteed Partial/ProbabilisticguaranteeFailure Detector Properties•Completeness•Accuracy•Speed–Time to first detection of a failure•Scale–Equal Load on each member–Network Message LoadTime until some process detects the failureGuaranteed Partial/ProbabilisticguaranteeFailure Detector Properties•Completeness•Accuracy•Speed–Time to first detection of a failure•Scale–Equal Load on each member–Network Message LoadTime until some process detects the failureGuaranteed Partial/ProbabilisticguaranteeNo bottlenecks/single failure pointFailure Detector Properties•Completeness•Accuracy•Speed–Time to first detection of a failure•Scale–Equal Load on each member–Network Message LoadIn spite of arbitrary simultaneous process failuresCentralized Heartbeating…pi, Heartbeat Seq. l++ pi Hotspotpj•Heartbeats sent periodically•If heartbeat not received from pi withintimeout, mark pi as failedRing Heartbeatingpi, Heartbeat Seq. l++ Unpredictable onsimultaneous multiple failurespi……pjAll-to-All Heartbeatingpi, Heartbeat Seq. l++… Equal load per memberpipjGossip-style HeartbeatingArray of Heartbeat Seq. lfor member subset Good accuracy properties (more soon!)piFD: Is this the “best” protocol ?•Most scalable ?•Most accurate ?•Detects failures quickest ?•Achieves best possible trade-off ?•What is the best possible trade-off ?Gossip-Style Failure Detection1243AddressHeartbeat CounterTimeGossiping this list to othersAnd when a node receives it, merge this list with its list1 10120 662 10103 623 10098 634 10111 65Gossip-Style Failure Detection11 10120 662 10103 623 10098 634 10111 65243Gossiping this list to othersAnd when a node receives it, merge this list with its list1 10118 642 10110 643 10090 584 10111 651 10120 702 10110 643 10098 704 10111 65Current time : 70 at node 2(asynchronous clocks)Gossip-Style Failure Detection•If the heartbeat has not increased for more than Tfail seconds, the member is considered failed•And after Tcleanup seconds, it will delete the member from the list•Why?Gossip-Style Failure Detection•What if an entry pointing to a failed node is deleted right after Tfail seconds?11 10120 662 10103 623 10098 554 10111 652431 10120 662 10110 643 10098 504 10111 651 10120 662 10110 644 10111 651 10120 662 10110 643 10098 754 10111 65Current time : 75 at node 2Pictorially…failuretimet+Tfailt+Tcleanup=t+2*TfailtAnalysis/Discussion•What happens if gossip period Tgossip is decreased? •A single heartbeat takes O(log(N)) time to propagate•N heartbeats take: –O(log(N)) time to propagate if bandwidth allowed per node are allowed to be O(N)–O(N.log(N)) time to propagate if bandwidth allowed per node is only O(1)–What about O(k) bandwidth?•What happens to Pmistake (false positive rate) as Tfail ,Tcleanup is increased? •Tradeoff: False positive rate vs. detection timeSimulations•As # members increases, the detection time increases•As requirement is loosened, the detection time decreases•As # failed members increases, the detection time increases significantly•The algorithm is resilient to

View Full Document