Unformatted text preview:

Fault tolerance CS514 Intermediate Course in Operating Systems We ve been skirting the issue of faulttolerant distributed computing Professor Ken Birman Vivek Vishnumurthy TA Failure models Issues related to failures Who needs failure models How do systems fail Given a category of failures are there limits to what can we do about it The problem is that processes can fail in so many ways Real world studies of failure rates Experience with some big projects that failed Formal models of failure crash fail stop Byzantine A famous but confusing impossibility result Bohrbugs and Heisenbugs A categorization due to Bruce Lindsey Bohrbugs are dull boring debuggable bugs They happen every time you run the program and are easy to localize and fix using modern development tools If purify won t find it try binary search Heisenbugs are hard to pin down Often associated with threading or interrupts Frequently a data structure is damaged but this is only noticed much later Hence hard to reproduce and so hard to fix In mature programs Heisenbugs dominate Although scalability was also a motivation But in general what does it mean for a system to tolerate failures Today focus on failure models Today explore this issue Fault tolerance motivates us to use gossip protocols and similar mechanisms Hardware failures are rare but they happen Software bugs can cause a program to malfunction by crashing corrupting data or just failing to do its job Intruders might inject some form of failure to disrupt or compromise a system A failure detector could malfunction signaling a failure even though nothing is wrong Clean room development Idea is that to write code First the team develops a good specification and refines it to modules A primary coding group implements them Then the whole group participates in code review Then the primary group develops a comprehensive test suite and runs it Finally passes off to a Q A group that redoes these last stages code review testing Later upgrades require same form of Q A 1 Reality Depends very much on the language Why do systems fail With Java and C we get strong type checking and powerful tools to detect many kinds of mistakes Also clean abstraction boundaries But with C and C and Fortran we lack such tools The methodology tends to require good tools What can we do about it Better programming languages approaches and tools can help Who needs failure models Role of a failure model These are common in real systems Crash failures process simply stops and does nothing wrong that would be externally visible before it stops These faults can t be directly detected Lets us reduce fault tolerance to a mathematical question But we should anticipate that large systems will exhibit problems Failures are a side effect of using technology to solve complex problems Crash faults message loss Incorrect specifications e g the program just doesn t work in the first place Lingering Heisenbugs often papered over Administrative errors Unintended side effects of upgrades and bug fixes are dominant causes of failures For example shift from C to Java and C has been hugely beneficial Categories of failures Many studies of this issue suggest that In model M can problem P be solved How costly is it to do so What are the best solutions What tradeoffs arise And clarifies what we are saying Lacking a model confusion is common Categories of failures Fail stop failures These require system support Idea is that the process fails by crashing and the system notifies anyone who was talking to it With fail stop failures we can overcome message loss by just resending packets which must be uniquely numbered Easy to work with but rarely supported 2 Categories of failures Non malicious Byzantine failures This is the best way to understand many kinds of corruption and buggy behaviors Program can do pretty much anything including sending corrupted messages But it doesn t do so with the intention of screwing up our protocols We tend to work within two models Asynchronous model makes no assumptions about time But communication is usually assumed to be reliable any message sent at time t is delivered by time t Algorithms are often structured into rounds each lasting some fixed amount of time giving time for each process to communicate with every other process In this model a crash failure is easily detected Model is of an attacker who has studied the system and wants to break it She can corrupt or replay messages intercept them at will compromise programs and substitute hacked versions This is a worst case scenario mindset In practice doesn t actually happen Very costly to defend against typically used in very limited ways e g key mgt server Failures in the asynchronous model Network is assumed to be reliable But processes can fail Synchronous model assumes a lock step execution in which processes share a clock Here we also have processes and messages Malicious true Byzantine failures Processes have no clocks will wait indefinitely for messages could run arbitrarily fast slow Distributed computing at an eons timescale What about the synchronous model Unfortunately a pretty common mode of failure Recall Two kinds of models Categories of failure A failed process crashes it stops doing anything Notice that in this model a failed process is indistinguishable from a delayed process In fact the decision that something has failed takes on an arbitrary flavor Suppose that at point e in its execution process p decides to treat q as faulty Neither model is realistic Value of the asynchronous model is that it is so stripped down and simple If we can do something well in this model we can do at least as well in the real world So we ll want best solutions Value of the synchronous model is that it adds a lot of unrealistic mechanism If we can t solve a problem with all this help we probably can t solve it in a more realistic setting So seek impossibility results 3 Examples of results It is common to look at problems like agreeing on an ordering Can we implement a fault tolerant agreement protocol A surprising result Bivalent state S denotes bivalent state S0 denotes a decision 0 state S1 denotes a decision 1 state Many notions of consistency reduce to agreement on the events that occurred and their order Could imagine that our bit represents Whether or not a particular event took place Whether event A is the next event Thus fault tolerant consensus is deeply related to fault tolerant consistency Core of FLP result And this is


View Full Document

CORNELL CS 514 - Lecture Slides

Documents in this Course
LECTURE

LECTURE

29 pages

LECTURE

LECTURE

28 pages

Load more
Download Lecture Slides
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Slides and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Slides 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?