CORNELL CS 514 - Lecture Slides - D2759761

Home> Schools> Cornell University> Computer Science (CS) > CS 514> Lecture Slides

DOC PREVIEW

CORNELL CS 514 - Lecture Slides

School name Cornell University

Course Cs 514- Intermediate Computer Systems

Pages 55

This preview shows page 1-2-3-4-25-26-27-52-53-54-55 out of 55 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 55 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS514: Intermediate Course in Operating SystemsFault-tolerance concernFailure modelsWho needs failure “models”?Bohrbugs and HeisenbugsClean-room developmentReality?Why do systems fail?What can we do about it?Slide 10Categories of failuresSlide 12Slide 13Categories of failureModels of failureRecall: Two kinds of modelsAdding failures in Lamport’s modelWhat about the synchronous model?Neither model is realisticExamples of resultsConnection to consistencyFischer, Lynch and PattersonCore of FLP resultBivalent stateSlide 25Slide 26Slide 27Slide 28Core of FLP result in wordsSlide 30But how did they “really” do it?Intuition behind this result?But what did “impossibility” mean?RecapTougher failure modelsHere the situation is much harderByzantine scenarioSlide 38A timeline perspectiveSlide 40What can the traitor do?Outcomes?What can we do?Digital signaturesWith such a systemSlide 46Slide 47Traitor is stymiedRecent work with Byzantine modelByzantine QuorumsSplit secretsHow split secrets workByzantine Broadcast (BB)Pros and cons to BBTake-aways?CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanVivek Vishnumurthy: TAFault-tolerance concernOur reason for discussing transactions was to improve fault-toleranceThey ensured that a server can restart in a cleaned up stateBut sometimes we need high availability and yet we want consistent behaviorWhat issues arise if things can fail and we can’t wait for restart, or “abort” what we were doing?Failure modelsIssues related to failuresHow do systems “fail?” Given a category of failures, are there limits to what can we do about it?Today explore this issueReal world studies of failure ratesExperience with some big projects that failedFormal models of failure (crash, fail-stop, Byzantine)A famous (but confusing) impossibility resultWho needs failure “models”?The problem is that processes can fail in so many waysHardware failures are rare, but they happenSoftware bugs can cause a program to malfunction by crashing, corrupting data, or just failing to “do its job”Intruders might inject some form of failure to disrupt or compromise a systemA failure detector could malfunction, signaling a failure even though nothing is wrongBohrbugs and HeisenbugsA categorization due to Bruce LindseyBohrbugs are dull, boring, debuggable bugsThey happen every time you run the program and are easy to localize and fix using modern development toolsIf “purify” won’t find it… try binary searchHeisenbugs are hard to pin downOften associated with threading or interruptsFrequently a data structure is damaged but this is only noticed much laterHence hard to reproduce and so hard to fixIn mature programs, Heisenbugs dominateClean-room developmentIdea is that to write codeFirst, the team develops a good specification and refines it to modulesA primary coding group implements themThen the whole group participates in code reviewThen the primary group develops a comprehensive test suite and runs itFinally passes off to a Q/A group that redoes these last stages (code review, testing)Later, upgrades require same form of Q/A!Reality?Depends very much on the languageWith Java and C# we get strong type checking and powerful tools to detect many kinds of mistakesAlso clean abstraction boundariesBut with C++ and C and Fortran, we lack such toolsThe methodology tends to require good toolsWhy do systems fail?Many studies of this issue suggest thatIncorrect specifications (e.g. the program just doesn’t “work” in the first place)Lingering Heisenbugs, often papered-overAdministrative errorsUnintended side-effects of upgrades and bug fixes… are dominant causes of failures.What can we do about it?Better programming languages, approaches and tools can helpFor example shift from C to Java and C# has been hugely beneficialBut we should anticipate that large systems will exhibit problemsFailures are a side-effect of using technology to solve complex problems!Who needs failure “models”?Role of a failure modelLets us reduce fault-tolerance to a mathematical questionIn model M, can problem P be solved?How costly is it to do so?What are the best solutions?What tradeoffs arise?And clarifies what we are sayingLacking a model, confusion is commonCategories of failuresCrash faults, message lossThese are common in real systemsCrash failures: process simply stops, and does nothing wrong that would be externally visible before it stopsThese faults can’t be directly detectedCategories of failuresFail-stop failuresThese require system supportIdea is that the process fails by crashing, and the system notifies anyone who was talking to itWith fail-stop failures we can overcome message loss by just resending packets, which must be uniquely numberedEasy to work with… but rarely supportedCategories of failuresNon-malicious Byzantine failuresThis is the best way to understand many kinds of corruption and buggy behaviorsProgram can do pretty much anything, including sending corrupted messagesBut it doesn’t do so with the intention of screwing up our protocolsUnfortunately, a pretty common mode of failureCategories of failureMalicious, true Byzantine, failuresModel is of an attacker who has studied the system and wants to break itShe can corrupt or replay messages, intercept them at will, compromise programs and substitute hacked versionsThis is a worst-case scenario mindsetIn practice, doesn’t actually happenVery costly to defend against; typically used in very limited ways (e.g. key mgt. server)Models of failureQuestion here concerns how failures appear in formal models used when proving things about protocolsThink back to Lamport’s happens-before relationship, Model already has processes, messages, temporal orderingAssumes messages are reliably deliveredRecall: Two kinds of modelsWe tend to work within two modelsAsynchronous model makes no assumptions about timeLamport’s model is a good fitProcesses have no clocks, will wait indefinitely for messages, could run arbitrarily fast/slowDistributed computing at an “eons” timescaleSynchronous model assumes a lock-step execution in which processes share a clockAdding failures in Lamport’s modelAlso called the asynchronous

View Full Document