DOC PREVIEW
MSU CSE 870 - Software Fault-Tolerance

This preview shows page 1-2-17-18-19-35-36 out of 36 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 36 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Software Fault-ToleranceMotivating QuestionsOutlineFaultRelation Between Faults, Errors, and FailuresFault TypesFault Types - ContinuedNature of FaultsProgram Observation of FaultsFault-ToleranceFault-Tolerance - ContinuedSlide 12Software Fault-Tolerance ComplexityFault-Tolerance – A Cross-Cutting ConcernPowerPoint PresentationDesign ApproachesRecovery BlocksRecovery Blocks SyntaxRecovery Blocks - ExampleRecovery Blocks - AlternatesProviding Reset in Recovery BlocksRecovery Blocks and Interacting ProcessesRecovery Blocks and Interacting Processes - ContinuedSlide 24Checkpointing and RecoveryUncoordinated CheckpointingCoordinated CheckpointingCommunication-Induced CheckpointingImplementation of Checkpointing-RecoveryLog-Based RecoverySlide 31State Machine ApproachReplicated ServerReplica Management ProtocolSummarySummary - Continued1Software Fault-ToleranceAli EbnenasirComputer Science and Engineering DepartmentMichigan State UniversityU.S.A.2Motivating Questions1. What are faults?2. What is fault-tolerance?3. What is the difference between software fault-tolerance and hardware fault-tolerance?4. Why do we need to give special consideration to software fault-tolerance?5. Who should care about it? (Analyst/Designer/Programmer?)6. How do we ensure that a system tolerates faults? After this lecture, you should have a clear idea of how to addressabove questions3Outline•Basic concepts–Faults, errors, failures–Types and nature of fault•Challenges in software fault-tolerance•Fault-tolerance mechanisms–Recovery blocks–Checkpointing & recovery•Non-Transparent & Transparent Approaches–State machine approach •A fundamental theory of fault-tolerance•Component-based design of fault-tolerance•Verification and synthesis of fault-tolerance4Fault•An event in the physical domain of a system–Component failure in hardware systems–Divide by zero–A wire is stuck at a fixed voltage–A process restarts–A message is lost in the communication channel–A process occasionally misses a message in communicating with others–A process behaves arbitrarily–An input sensor is corrupted–Load surges in the networkHow about design inadequacies? (s/w, h/w)5Relation Between Faults, Errors, and Failures•Fault causes an internal error state in the information domain–E.g., a process restarts and resets the value of all variables to zero•Error states cause the observable system behaviors to go stray (failed behaviors)•Failure is a deviation from specified/desired behavior •Depends on the specification6Fault Types•Crash: a component crashes with an undetectable –E.g., a node crashes in a network without being detected by other nodes•Fail-stop: a component fails in a detectable fashion•Omission: a component does not perform a particular action–E.g., the receiver of a message does not reply by an ACK•Timing: a component does not perform a particular action at the right time–E.g., the receiver of a message does not reply in a specific amount of time7Fault Types - Continued•Performance: a component does not provide the required performance–E.g., congestion in communication channels•Assertive: the communicated data is wrong (syntactically/semantically)•Byzantine: a component behaves arbitrarily–E.g., a sensor arbitrarily changes its sampled data8Nature of Faults•Permanent: faults corrupt a component permanently–E.g., crash•Transient: faults corrupt a component momentarily; i.e., appears once and then disappears –E.g., Electrical surge, spurious interrupt, illegal opcode•Intermittent: faults corrupt a component sporadically; i.e., appear in a short time and disappear spontaneously –E.g., loose contact on a connector9Program Observation of Faults •The ability of a program to observe faults–Detectable•E.g., fail-stop–Undetectable•E.g., transient faults•Undetectable faults are hard to mask; mostly handled by self-stabilization10Fault-Tolerance•Providing a desired level of functionality in the presence of faults–E.g., MC6800 provides recovery mechanism when executing an illegal opcode–A distributed files system works despite the failure of a node–A nuclear reactor shuts down safely when something bad happens•How do we define the “desired level of functionality”?•Can programs tolerate all faults?We have to define our expectation of a system in the presence of faults11Fault-Tolerance - Continued •Fault-tolerance is defined w.r.t system specification•Example:–In the case of power outage in a hospital, the emergency power will be activated to power on safety-critical medical devices, however no TV will be powered on •Often a weaker form of specification is satisfied in the presence of faults12Software Fault-Tolerance•What is the difference between s/w and h/w fault-tolerance?•Hardware faults often occur due to component failure•Fault-tolerance can be achieved by replacing a component or having a stand-by spare•Correct design is achievable for hardware systems•Modular reasoning in hardware design13Software Fault-Tolerance Complexity•Why is software fault-tolerance more complicated?•The complexity of h/w systems is far less than s/w systems–The total number of states–Combination of components•Software systems could easily have hundreds of millions of interacting computational components•Combinatorial nature of software systems–Achieving correct design is difficult in software systems–Fault detection is much more difficult–Design inadequacy; i.e., design correctness is hard to achieve14Fault-Tolerance – A Cross-Cutting ConcernProgramModule1ModulenModulen1Module1iModule11Modulenj. . .. . . . . .• Fault-tolerance should be provided in all levels• Fault-tolerance should be added to the components in such a way that the entire program is fault-tolerant15Software Fault-Tolerance Mechanisms16Design Approaches•Recovery blocks [Randall 75]–Wrap program with blocks of code for recovery•Checkpointing and recovery [StromYemini 85]–In the absence of faults, save the state of the computations –In the presence of faults, restore the state of the system to a legitimate saved state•State machine approach (Replication) [Schneider 90]–Server-client model–Servers as state machines–Replicate servers[Randall 75] B. Randall, System Structure for Software Fault-tolerance, IEEE TSE, pages 220-232, 1975.[StromYemini 85] R. E. Strom and S. Yemini, Optimistic recovery in distributed


View Full Document

MSU CSE 870 - Software Fault-Tolerance

Documents in this Course
HW2

HW2

3 pages

splc1

splc1

21 pages

Lessons

Lessons

3 pages

revision

revision

13 pages

ft1

ft1

12 pages

john.dsn

john.dsn

21 pages

Survey

Survey

2 pages

revision

revision

38 pages

Load more
Download Software Fault-Tolerance
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Software Fault-Tolerance and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Software Fault-Tolerance 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?