CORNELL CS 514 - Fault tolerance - D1628811

Home> Schools> Cornell University> Computer Science (CS) > CS 514> Fault tolerance

DOC PREVIEW

CORNELL CS 514 - Fault tolerance

School name Cornell University

Course Cs 514- Intermediate Computer Systems

Pages 54

This preview shows page 1-2-3-4-25-26-27-51-52-53-54 out of 54 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 54 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS514: Intermediate Course in Operating SystemsFault toleranceFailure modelsWho needs failure “models”?Bohrbugs and HeisenbugsClean-room developmentReality?Why do systems fail?What can we do about it?Slide 10Categories of failuresSlide 12Slide 13Categories of failureRecall: Two kinds of modelsFailures in the asynchronous modelWhat about the synchronous model?Neither model is realisticExamples of resultsConnection to consistencyFischer, Lynch and PattersonCore of FLP resultBivalent stateSlide 24Slide 25Slide 26Slide 27Core of FLP result in wordsSlide 29But how did they “really” do it?Intuition behind this result?But what did “impossibility” mean?RecapTougher failure modelsHere the situation is much harderByzantine scenarioSlide 37A timeline perspectiveSlide 39What can the traitor do?Outcomes?What can we do?Digital signaturesWith such a systemSlide 45Slide 46Traitor is stymiedRecent work with Byzantine modelByzantine QuorumsSplit secretsHow split secrets workByzantine Broadcast (BB)Pros and cons to BBTake-aways?CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanVivek Vishnumurthy: TAFault toleranceWe’ve been skirting the issue of fault-tolerant distributed computingFault-tolerance motivates us to use gossip protocols and similar mechanismsAlthough scalability was also a motivationBut in general, what does it mean for a system to “tolerate” failures?Today: focus on failure modelsFailure modelsIssues related to failuresHow do systems “fail?” Given a category of failures, are there limits to what can we do about it?Today explore this issueReal world studies of failure ratesExperience with some big projects that failedFormal models of failure (crash, fail-stop, Byzantine)A famous (but confusing) impossibility resultWho needs failure “models”?The problem is that processes can fail in so many waysHardware failures are rare, but they happenSoftware bugs can cause a program to malfunction by crashing, corrupting data, or just failing to “do its job”Intruders might inject some form of failure to disrupt or compromise a systemA failure detector could malfunction, signaling a failure even though nothing is wrongBohrbugs and HeisenbugsA categorization due to Bruce LindseyBohrbugs are dull, boring, debuggable bugsThey happen every time you run the program and are easy to localize and fix using modern development toolsIf “purify” won’t find it… try binary searchHeisenbugs are hard to pin downOften associated with threading or interruptsFrequently a data structure is damaged but this is only noticed much laterHence hard to reproduce and so hard to fixIn mature programs, Heisenbugs dominateClean-room developmentIdea is that to write codeFirst, the team develops a good specification and refines it to modulesA primary coding group implements themThen the whole group participates in code reviewThen the primary group develops a comprehensive test suite and runs itFinally passes off to a Q/A group that redoes these last stages (code review, testing)Later, upgrades require same form of Q/A!Reality?Depends very much on the languageWith Java and C# we get strong type checking and powerful tools to detect many kinds of mistakesAlso clean abstraction boundariesBut with C++ and C and Fortran, we lack such toolsThe methodology tends to require good toolsWhy do systems fail?Many studies of this issue suggest thatIncorrect specifications (e.g. the program just doesn’t “work” in the first place)Lingering Heisenbugs, often papered-overAdministrative errorsUnintended side-effects of upgrades and bug fixes… are dominant causes of failures.What can we do about it?Better programming languages, approaches and tools can helpFor example shift from C to Java and C# has been hugely beneficialBut we should anticipate that large systems will exhibit problemsFailures are a side-effect of using technology to solve complex problems!Who needs failure “models”?Role of a failure modelLets us reduce fault-tolerance to a mathematical questionIn model M, can problem P be solved?How costly is it to do so?What are the best solutions?What tradeoffs arise?And clarifies what we are sayingLacking a model, confusion is commonCategories of failuresCrash faults, message lossThese are common in real systemsCrash failures: process simply stops, and does nothing wrong that would be externally visible before it stopsThese faults can’t be directly detectedCategories of failuresFail-stop failuresThese require system supportIdea is that the process fails by crashing, and the system notifies anyone who was talking to itWith fail-stop failures we can overcome message loss by just resending packets, which must be uniquely numberedEasy to work with… but rarely supportedCategories of failuresNon-malicious Byzantine failuresThis is the best way to understand many kinds of corruption and buggy behaviorsProgram can do pretty much anything, including sending corrupted messagesBut it doesn’t do so with the intention of screwing up our protocolsUnfortunately, a pretty common mode of failureCategories of failureMalicious, true Byzantine, failuresModel is of an attacker who has studied the system and wants to break itShe can corrupt or replay messages, intercept them at will, compromise programs and substitute hacked versionsThis is a worst-case scenario mindsetIn practice, doesn’t actually happenVery costly to defend against; typically used in very limited ways (e.g. key mgt. server)Recall: Two kinds of modelsWe tend to work within two modelsAsynchronous model makes no assumptions about timeProcesses have no clocks, will wait indefinitely for messages, could run arbitrarily fast/slowDistributed computing at an “eons” timescaleSynchronous model assumes a lock-step execution in which processes share a clockFailures in the asynchronous modelNetwork is assumed to be reliableBut processes can failA failed process “crashes:” it stops doing anythingNotice that in this model, a failed process is indistinguishable from a delayed processIn fact, the decision that something has failed takes on an arbitrary flavorSuppose that at point e in its execution, process p decides to treat q as faulty….”What about the synchronous

View Full Document