CORNELL CS 5410 - Failure Detection - D1623888

Home> Schools> Cornell University> Computer Science (CS) > CS 5410> Failure Detection

CORNELL CS 5410 - Failure Detection

Course Cs 5410- Intermediate Computer Systems

Pages 48

Download Save

Unformatted text preview:

iKen BirmanCornell University. CS5410 Fall 2008. Failure detection… vs Maskingy Failure detection: in some sense, “weakest”y Assumes that failures are rare and localizedyDevelops a mechanism to detect faults with low rates of false yDevelops a mechanism to detect faults with low rates of false positives (mistakenly calling a healthy node “faulty”)y Challenge is to make a sensible “profile” of a faulty nodeFil ki “”yFailure masking: “strong”y Idea here is to use a group of processes in such a wa y that as long as the number of faults is below some threshold, progress can still be madey Self stabilization: “strongest”. yMasks failures and repairs itself even after arbitrary faultsyMasks failures and repairs itself even after arbitrary faultsFirst must decide what you mean by failurey A system can fail in many waysy Crash (or halting) failure: silent, instant, cleany Sick: node is somehow damagedy Compromise: hacker takes over with malicious intenty But that isn’t all….Also need to know what needs to work!P2P connectivityConnectivity Amazon Will live objects work here?yissuesFi ll/NATfirewall, NAThere?Amazon.comSlow link….Firewall/NATCan I connect? Will IPMC work here or do I need an ov erlay? Is my performance adequate (throughput, RTT, jitter)? Loss rate tolerable?Missing datay Today, distributed systems need to run in very challenging and unpredictable environmentsW d’ h dd if h id yWe don’t have a standard wa y to specify the required performance and “quality of service” expectationsy So, each application needs to test the environment in its own, specialized wa yits own, specialized wa yy Especially annoying in systems that have multiple setup options and perhaps could work around an issuey For example, multicast: could be via IPMC or via overlayNeeded?y Application comes with a “quality of service contract”y Presents it to some sort of management servicey That service studies the contracty Maps out the state of the networkCld I il t thiyConcludes: yes, I can implement thisy Configures the application(s) appropriatelyyLater: watches and if conditions evolve rec onfigures yLater: watches and if conditions evolve, rec onfigures the application nodesy See: Rick Schantz: QuO(Quality of Service for Q(Q yObjects) for more details on how this could workExampley Live objects within a corporate LANy End points need multicast… discover that IPMC is ki d ht tiworking and cheapest optiony Now someone joins from outside firewallySystem adapts: uses an overlay that runs IPMC within ySystem adapts: uses an overlay that runs IPMC within the LAN but tunnels via TCP to the remote nodeyAdds a new corporate LAN site that disallows IPMCpy System adapts again: needs an overla y now…ExampleTCP tunnels create a WAN overlayIPMC works hereMust use UDP hereFailure is a state transitiony Something that was working no longer worksy For example, someone joins a group but IPMC can’t h thi b h’ll i % lreach this new member, so he’ll experience 100% lossyIf we think of a working application as having a yIf we think of a working application as having a contract with the system (an implicit one), the contract was “violated” by a change of system statey g yyAll of this is very ad‐hoc todayy Mostly we only use timeouts to sense faultsHidden assumptionsy Failure detectors reflect many kinds of assumptionsy Healthy behavior assumed to have a simple profilel ll l hy For example, all RPC requests trigger a reply within Xmsy Typically, minimal “suspicion”yIf a node sees what seems to be faulty behavior it reports the yIf a node sees what seems to be faulty behavior, it reports the problem and others trust ity Implicitly: the odds that the report is from a node that was itself faulty are assumed to be very low If it look like a fault to itself faulty are assumed to be very low. If it look like a fault to anyone, then it probably was a fault…y For example (and most commonly): timeoutsTimeouts: Pros and ConsProsy Easy to implement Consy Easily fooledyAlready used in TCPy Many kinds of problems if t y Vogels: If your neighbor doesn’t collect the mail at 1pm like she usually manifest as severe slowdowns (memory leaks, faulty devices…)at 1pm like she usually does, would you assume that she has died?, y )y Real failures will usually render a service “silent”y Vogels: Anyhow, what if a service hangs but low‐ll i ill k?level pings still work?A “Vogels scenario” (one of many)Nt k t li t t bli h yNetwork outage causes client to believe server has crashed and server to believe client is downyNow imagine this happening to thousands of nodes all yNow imagine this happening to thousands of nodes all at once… triggering chaosVogels argues for sophisticationy Has been burned by situations in which network problems trigger massive flood of “failure detections”S h hld k f idi ySuggests that we should make more use of indirect information such asyHealth of the routers and network infrastructureyHealth of the routers and network infrastructurey If the remote O/S is still alive, can check its management information basey Could also require a “vote” within some group that all talk to the same service –if a majority agree that the service is faulty odds that it is faulty are wa y higherservice is faulty, odds that it is faulty are way higherOther side of the picturey Implicit in Vogels’ perspective is view that failure is a real thing, an “event”S li i i hlh b hi ySuppose my application is healthy but my machine starts to thrash because of some other problemyIs my application “alive” or “faulty”?Is my application alive or faulty ?y In a data center, normally, failure is a cheap thing to handle. y Perspective suggests that Vogels isy Right in his worries about the data center‐wide scenarioy But too conservativ e in normal caseOther side of the picturey Imagine a buggy network applicationy Its low‐level windowed acknowledgement layer is ki ll d l ll iti i fiworking well, and low level communication is finey But at the higher level, some thread took a lock but now is wedged and will never resume progressg pgy That application may respond to “are you ok?” with “yes, I’m absolutely fine”…. Yet is actually dead!y Suggests that applications should be more self‐checkingy But this

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 5410 - Failure Detection

Sign up for free to view:

Please select your school