DOC PREVIEW
CORNELL CS 514 - CS 514 Lecture Notes

This preview shows page 1-2-3-4-25-26-27-52-53-54-55 out of 55 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 55 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS514: Intermediate Course in Operating SystemsHow and Why Systems FailFailureSystems can be built in many waysClean-room conceptBut systems still fail!Bohrbugs and HeisenbugsProgrammer facing bugsLifecycle of BohrbugSlide 10Lifecycle of a HeisenbugHow programmers fix a BohrbugHow they fix HeisenbugsBug-free software is uncommonBugs in a typical distributed systemTools can helpLeslie LamportComponent Architectures Make it WorseReminder: Networks versus Distributed SystemsComponent Systems?Thought questionSlide 22ExampleSlide 24Arianne RocketSlide 26Slide 27Insights?Detecting failureExample: pinging with timeoutsConsistent failure detectionComponent failure detectionVogels: the Failure InvestigatorFurther options: “Hot” buttonFriedman’s approachBroad picture?Suggested solution?Suggested SolutionSlide 39Replication: the key technologyBasic issues with the approachClient-Server issuesClient-server issuesSlide 44Slide 45Slide 46Slide 47Slide 48Slide 49Other approaches to considerSlide 51Logging and checkpointsByzantine approachHard practical problemComing next? (Oct. 12)CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanBen Atkin: TALecture 15 Oct. 17How and Why Systems Fail•We’ve talked about–Transactional reliability–Replication for high availability•But does this give us “fault-tolerant solutions?”•How and why do real systems fail?•Do real systems offer the hooks we’ll need to intervene?Failure •Failure is just one of the aspects of reliability, but it is clearly an important one•To make a system fault-tolerant we need to understand how to detect failures and plan an appropriate response if a failure occurs•This lecture focuses on how systems fail, how they can be “hardened”, and what still fails after doing soSystems can be built in many ways•Reliability is not always a major goal when development first starts•Most systems evolve over time, through incremental changes with some rewriting •Most reliable systems are entirely rewritten using clean-room techniques after they reach a mature stage of developmentClean-room concept•Based on goal of using “best available” practice•Requires good specifications•Design reviews in teams•Actual software also reviewed for correctness•Extensive stress testing and code coverage testing, use tools like “Purify”•Use of formal proof tools where practicalBut systems still fail!•Gray studied failures in Tandem systems•Hardware was fault-tolerant and rarely caused failures•Software bugs, environmental factors, human factors (user error), incorrect specification were all major sources of failureBohrbugs and Heisenbugs•Classification proposed by Bruce Lindsey•Bohrbug: like the Bohr model of the nucleus: solid, easily reproduced, can track it down and fix it•Heisenbug: like the Heisenberg nucleus: a diffuse cloud, very hard to pin down and hence fix•Anita Borr and others have studied life-cycle bugs in complex software using this classificationProgrammer facing bugsHeisenbug is fuzzy,hard to find/fixBohrbug is solid,easy to recognize and fix?Lifecycle of Bohrbug•Usually introduced in some form of code change or in original design•Often detected during thorough testing•Once seen, easily fixed•Remain a problem over life-cycle of software because of need to extend system or to correct other bugs. •Same input will reliably trigger the bug!Lifecycle of BohrbugA Bohrbug is boring.Lifecycle of a Heisenbug•These are often side-effects of some other problem•Example: bug corrupts a data structure or misuses a pointer. Damage is not noticed right away, but causes a crash much later when structure is referenced•Attempting to detect the bug may shift memory layout enough to change its symptoms!How programmers fix a Bohrbug•They develop a test scenario that triggers it•Use a form of binary search to narrow in on it•Pin down the bug and understand precisely what is wrong•Correct the algorithm or the coding error•Retest extensively to confirm that the bug is fixedHow they fix Heisenbugs•They fix the symptom: periodically scan the structure that is ususally corrupted and clean it up•They add self-checking code (which may itself be a source of bugs)•They develop theories of what is wrong and fix the theoretical problem, but lack a test to confirm that this eliminated the bug•These bugs are extremely sensitive to event ordersBug-free software is uncommon•Heavily used software may become extremely reliable over its life (the C compiler rarely crashes, UNIX is pretty reliable by now)•Large, complex systems depend upon so many components, many complex, that bug freedom is an unachievable goal•Instead, adopt view that bugs will happen and we should try and plan for themBugs in a typical distributed system•Usual pattern: some component crashes or becomes partitioned away•Other system components that depend on it freeze or crash too•Chains of dependencies gradually cause more and more of the overall system to fail or freezeTools can help•Everyone should use tools like “purify” (detects stray pointers, uninitialized variables and memory leaks)•But these tools don’t help at the level of a distributed system•Benefit of a model, like transactions or virtual synchrony, is that the model simplifies developer’s taskLeslie Lamport“A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable”•Issue is dependency on critical components•Notion is that state and “health” of system at site A is linked to state and health at site BComponent Architectures Make it Worse•Modern systems are structured using object-oriented component interfaces:–CORBA, COM (or DCOM), Jini–XML•In these systems, we create a web of dependencies between components•Any faulty component could cripple the system!Reminder: Networks versus Distributed Systems•Network focus is on connectivity but components are logically independent: program fetches a file and operates on it, but server is stateless and forgets the interaction–Less sophisticated but more robust?•Distributed systems focus is on joint behavior of a set of logically related components. Can talk about “the system” as an entity.–But needs fancier failure handling!Component Systems?•These are distributed in the sense of our definition–Often, they share state between components–If a component fails,


View Full Document

CORNELL CS 514 - CS 514 Lecture Notes

Documents in this Course
LECTURE

LECTURE

29 pages

LECTURE

LECTURE

28 pages

Load more
Download CS 514 Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 514 Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 514 Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?