CS514: Intermediate Course in Operating SystemsFault toleranceFailure modelsWho needs failure “models”?Bohrbugs and HeisenbugsClean-room developmentReality?Why do systems fail?What can we do about it?Slide 10Categories of failuresSlide 12Slide 13Categories of failureRecall: Two kinds of modelsFailures in the asynchronous modelWhat about the synchronous model?Neither model is realisticExamples of resultsConnection to consistencyFischer, Lynch and PattersonCore of FLP resultBivalent stateSlide 24Slide 25Slide 26Slide 27Core of FLP result in wordsSlide 29But how did they “really” do it?Intuition behind this result?But what did “impossibility” mean?RecapTougher failure modelsHere the situation is much harderByzantine scenarioSlide 37A timeline perspectiveSlide 39What can the traitor do?Outcomes?What can we do?Digital signaturesWith such a systemSlide 45Slide 46Traitor is stymiedRecent work with Byzantine modelByzantine QuorumsSplit secretsHow split secrets workByzantine Broadcast (BB)Pros and cons to BBTake-aways?CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanVivek Vishnumurthy: TAFault toleranceWe’ve been skirting the issue of fault-tolerant distributed computingFault-tolerance motivates us to use gossip protocols and similar mechanismsAlthough scalability was also a motivationBut in general, what does it mean for a system to “tolerate” failures?Today: focus on failure modelsFailure modelsIssues related to failuresHow do systems “fail?” Given a category of failures, are there limits to what can we do about it?Today explore this issueReal world studies of failure ratesExperience with some big projects that failedFormal models of failure (crash, fail-stop, Byzantine)A famous (but confusing) impossibility resultWho needs failure “models”?The problem is that processes can fail in so many waysHardware failures are rare, but they happenSoftware bugs can cause a program to malfunction by crashing, corrupting data, or just failing to “do its job”Intruders might inject some form of failure to disrupt or compromise a systemA failure detector could malfunction, signaling a failure even though nothing is wrongBohrbugs and HeisenbugsA categorization due to Bruce LindseyBohrbugs are dull, boring, debuggable bugsThey happen every time you run the program and are easy to localize and fix using modern development toolsIf “purify” won’t find it… try binary searchHeisenbugs are hard to pin downOften associated with threading or interruptsFrequently a data structure is damaged but this is only noticed much laterHence hard to reproduce and so hard to fixIn mature programs, Heisenbugs dominateClean-room developmentIdea is that to write codeFirst, the team develops a good specification and refines it to modulesA primary coding group implements themThen the whole group participates in code reviewThen the primary group develops a comprehensive test suite and runs itFinally passes off to a Q/A group that redoes these last stages (code review, testing)Later, upgrades require same form of Q/A!Reality?Depends very much on the languageWith Java and C# we get strong type checking and powerful tools to detect many kinds of mistakesAlso clean abstraction boundariesBut with C++ and C and Fortran, we lack such toolsThe methodology tends to require good toolsWhy do systems fail?Many studies of this issue suggest thatIncorrect specifications (e.g. the program just doesn’t “work” in the first place)Lingering Heisenbugs, often papered-overAdministrative errorsUnintended side-effects of upgrades and bug fixes… are dominant causes of failures.What can we do about it?Better programming languages, approaches and tools can helpFor example shift from C to Java and C# has been hugely beneficialBut we should anticipate that large systems will exhibit problemsFailures are a side-effect of using technology to solve complex problems!Who needs failure “models”?Role of a failure modelLets us reduce fault-tolerance to a mathematical questionIn model M, can problem P be solved?How costly is it to do so?What are the best solutions?What tradeoffs arise?And clarifies what we are sayingLacking a model, confusion is commonCategories of failuresCrash faults, message lossThese are common in real systemsCrash failures: process simply stops, and does nothing wrong that would be externally visible before it stopsThese faults can’t be directly detectedCategories of failuresFail-stop failuresThese require system supportIdea is that the process fails by crashing, and the system notifies anyone who was talking to itWith fail-stop failures we can overcome message loss by just resending packets, which must be uniquely numberedEasy to work with… but rarely supportedCategories of failuresNon-malicious Byzantine failuresThis is the best way to understand many kinds of corruption and buggy behaviorsProgram can do pretty much anything, including sending corrupted messagesBut it doesn’t do so with the intention of screwing up our protocolsUnfortunately, a pretty common mode of failureCategories of failureMalicious, true Byzantine, failuresModel is of an attacker who has studied the system and wants to break itShe can corrupt or replay messages, intercept them at will, compromise programs and substitute hacked versionsThis is a worst-case scenario mindsetIn practice, doesn’t actually happenVery costly to defend against; typically used in very limited ways (e.g. key mgt. server)Recall: Two kinds of modelsWe tend to work within two modelsAsynchronous model makes no assumptions about timeProcesses have no clocks, will wait indefinitely for messages, could run arbitrarily fast/slowDistributed computing at an “eons” timescaleSynchronous model assumes a lock-step execution in which processes share a clockFailures in the asynchronous modelNetwork is assumed to be reliableBut processes can failA failed process “crashes:” it stops doing anythingNotice that in this model, a failed process is indistinguishable from a delayed processIn fact, the decision that something has failed takes on an arbitrary flavorSuppose that at point e in its execution, process p decides to treat q as faulty….”What about the synchronous
View Full Document