CS514: Intermediate Course in Operating SystemsFault-tolerance concernFailure modelsWho needs failure “models”?Bohrbugs and HeisenbugsClean-room developmentReality?Why do systems fail?What can we do about it?Slide 10Categories of failuresSlide 12Slide 13Categories of failureModels of failureRecall: Two kinds of modelsAdding failures in Lamport’s modelWhat about the synchronous model?Neither model is realisticExamples of resultsConnection to consistencyFischer, Lynch and PattersonCore of FLP resultBivalent stateSlide 25Slide 26Slide 27Slide 28Core of FLP result in wordsSlide 30But how did they “really” do it?Intuition behind this result?But what did “impossibility” mean?RecapTougher failure modelsHere the situation is much harderByzantine scenarioSlide 38A timeline perspectiveSlide 40What can the traitor do?Outcomes?What can we do?Digital signaturesWith such a systemSlide 46Slide 47Traitor is stymiedRecent work with Byzantine modelByzantine QuorumsSplit secretsHow split secrets workByzantine Broadcast (BB)Pros and cons to BBTake-aways?CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanVivek Vishnumurthy: TAFault-tolerance concernOur reason for discussing transactions was to improve fault-toleranceThey ensured that a server can restart in a cleaned up stateBut sometimes we need high availability and yet we want consistent behaviorWhat issues arise if things can fail and we can’t wait for restart, or “abort” what we were doing?Failure modelsIssues related to failuresHow do systems “fail?” Given a category of failures, are there limits to what can we do about it?Today explore this issueReal world studies of failure ratesExperience with some big projects that failedFormal models of failure (crash, fail-stop, Byzantine)A famous (but confusing) impossibility resultWho needs failure “models”?The problem is that processes can fail in so many waysHardware failures are rare, but they happenSoftware bugs can cause a program to malfunction by crashing, corrupting data, or just failing to “do its job”Intruders might inject some form of failure to disrupt or compromise a systemA failure detector could malfunction, signaling a failure even though nothing is wrongBohrbugs and HeisenbugsA categorization due to Bruce LindseyBohrbugs are dull, boring, debuggable bugsThey happen every time you run the program and are easy to localize and fix using modern development toolsIf “purify” won’t find it… try binary searchHeisenbugs are hard to pin downOften associated with threading or interruptsFrequently a data structure is damaged but this is only noticed much laterHence hard to reproduce and so hard to fixIn mature programs, Heisenbugs dominateClean-room developmentIdea is that to write codeFirst, the team develops a good specification and refines it to modulesA primary coding group implements themThen the whole group participates in code reviewThen the primary group develops a comprehensive test suite and runs itFinally passes off to a Q/A group that redoes these last stages (code review, testing)Later, upgrades require same form of Q/A!Reality?Depends very much on the languageWith Java and C# we get strong type checking and powerful tools to detect many kinds of mistakesAlso clean abstraction boundariesBut with C++ and C and Fortran, we lack such toolsThe methodology tends to require good toolsWhy do systems fail?Many studies of this issue suggest thatIncorrect specifications (e.g. the program just doesn’t “work” in the first place)Lingering Heisenbugs, often papered-overAdministrative errorsUnintended side-effects of upgrades and bug fixes… are dominant causes of failures.What can we do about it?Better programming languages, approaches and tools can helpFor example shift from C to Java and C# has been hugely beneficialBut we should anticipate that large systems will exhibit problemsFailures are a side-effect of using technology to solve complex problems!Who needs failure “models”?Role of a failure modelLets us reduce fault-tolerance to a mathematical questionIn model M, can problem P be solved?How costly is it to do so?What are the best solutions?What tradeoffs arise?And clarifies what we are sayingLacking a model, confusion is commonCategories of failuresCrash faults, message lossThese are common in real systemsCrash failures: process simply stops, and does nothing wrong that would be externally visible before it stopsThese faults can’t be directly detectedCategories of failuresFail-stop failuresThese require system supportIdea is that the process fails by crashing, and the system notifies anyone who was talking to itWith fail-stop failures we can overcome message loss by just resending packets, which must be uniquely numberedEasy to work with… but rarely supportedCategories of failuresNon-malicious Byzantine failuresThis is the best way to understand many kinds of corruption and buggy behaviorsProgram can do pretty much anything, including sending corrupted messagesBut it doesn’t do so with the intention of screwing up our protocolsUnfortunately, a pretty common mode of failureCategories of failureMalicious, true Byzantine, failuresModel is of an attacker who has studied the system and wants to break itShe can corrupt or replay messages, intercept them at will, compromise programs and substitute hacked versionsThis is a worst-case scenario mindsetIn practice, doesn’t actually happenVery costly to defend against; typically used in very limited ways (e.g. key mgt. server)Models of failureQuestion here concerns how failures appear in formal models used when proving things about protocolsThink back to Lamport’s happens-before relationship, Model already has processes, messages, temporal orderingAssumes messages are reliably deliveredRecall: Two kinds of modelsWe tend to work within two modelsAsynchronous model makes no assumptions about timeLamport’s model is a good fitProcesses have no clocks, will wait indefinitely for messages, could run arbitrarily fast/slowDistributed computing at an “eons” timescaleSynchronous model assumes a lock-step execution in which processes share a clockAdding failures in Lamport’s modelAlso called the asynchronous
View Full Document