DOC PREVIEW
MSU CSE 870 - Fault Tolerant Computing

This preview shows page 1-2-16-17-18-33-34 out of 34 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 34 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Fault Tolerant ComputingAcknowledgementsObjectivesReliability and RecoveryDependabilityTermsTerms (cont’d)Slide 8Design DecisionsSafety vs ReliabilityTypes of FaultsTypes of Faults (cont’d)ExamplesHierarchical Failure MaskingGroup Failure MaskingGroup Failure Masking (cont’d)Some FormalismComputationSpecificationExamples of specificationsExamples of specifications (cont’d)Slide 22Safety and LivenessLiveness SpecificationFaultsFaults (cont’d)Representation of FaultsRepresentation of Faults (cont’d)Representation of Permanent FaultsSlide 30Goal of Fault-tolerance DesignSlide 32Overall DesignOverall Design (cont’d)RRRFault Tolerant ComputingRRRAcknowledgements•The following lectures are based on materials from the following sources;–S. Kulkarni–J. Rushby–J. KnightRRRObjectives•Exposure to area of Critical Systems•What it means to have a fault-tolerant system•Specification techniques for representing critical properties•How to Design Fault tolerance into a systemRRRReliability and Recovery •Reliability –Probability that a system will not fail at time t if it was operating properly at time 0. •Recovery –Process of restoring consistency after a failureRRRDependability•Dependability:–How much one may rely on the quality of services delivered–Quality of service depends on:•Correctness•Continuity of serviceRRRTerms•Failure: malfunction •Fault: condition that might lead to failure •Error: an incorrect response indicates a fault is present •Faults may be: opermanent ointermittent otransientRRRTerms (cont’d)•Graceful Degradation •system is operational, but degraded, after faults•Fail-safe •system execution is safe after the fault •Stabilizing •system recovers to a consistent state after the fault •Masking •the user of the system does not see any unintended behavior due to faultsRRRTerms (cont’d)•Mean Time to Failure (MTTF) –expected value of system failure time •Mean Time to Repair (MTTR) –expected value of system repair time •Mean Time Between Failure –expected time between successive failures MTBF = MTTF + MTTR•Fault Tolerance –ability to continue operation after occurrence of faultsRRRDesign Decisions •Fault detection •Fault confinement •Fault diagnosis •Repair and/or reconfigure •Redundancy –Hardware: extra hardware –Information: redundancy bits –Software: diagnosis software, extra software –Temporal: re-execute software to recover from intermittent faultsRRRSafety vs Reliability•Reliability: –concerns occurrence of failures–System failures defined in terms of system services•Safety: concerns occurrence of accidents–Unplanned events that result in death, inury, illness, damage, loss of property or evironmental harm–Defined in terms of external consequencesRRRTypes of Faults •Omission failure –server omits to respond to an input (fail-silent failure) •Timing failure –response is functionally correct, but untimely - can be early timing failure or late timing failure –(performance failure) •Response failure –incorrect response –if output value incorrect (value failure) –state transition incorrect (state transition failure)RRRTypes of Faults (cont’d)•Crash failure –if after a first omission, a server omits to produce output until it restarts •Amnesia crash –server restarts in a predefined initial state that does not depend on the inputs seen before crash •Partial amnesia crash –some part of the state is the same before the crash; rest is in predefined initial state •Pause crash –server restarts in the state it had before the crash •Halting crash –crashed server never restartsRRRExamples•OS crashed followed by reboots in initial state •Database server crash followed by recovery of a database state that reflects all transactions before the crash •Communication server occasionally loses messages but does not delay messages (omission failure) •Excessive message transmission or message processing delay (communication performance failure) •Alteration of a message due to random noise during transmission (response failure)RRRHierarchical Failure Masking•A failure of a certain type at a lower level can propagate as a different kind of failure at a higher level abstraction. •Value Error at the physical layer (e.g., 2 bits corrupted) propagates as omission error at data link layerRRRGroup Failure Masking •To ensure a service remains available to clients despite server failure, –one can implement a group of redundant, physically independent servers. •The group masks the failure of a member. •Hierarchical masking requires:– users to implement resource failure-masking attempts as exception handling code. •In group masking, –individual members failures are entirely hidden from users by group management mechanisms.RRRGroup Failure Masking (cont’d)•Group output is a function of outputs of individual group members. –fastest member –distinguished member –result of majority vote •A server able to mask any k concurrent member failures will be termed k-fault tolerant–e.g., a primary/standby group of k servers with members ranked as primary, 1st backup, 2nd backup, ..., can mask k-1 failures.RRRSome Formalism Programs•A Program consists of:– a finite set of variables– a finite set of actions –where •guard is a boolean expression over program variables, and •statement updates program variables •Modifications –guards may contain receive from channels –statements may contain sends/receive guard statementRRRComputation•A program computation is a ``fair'' sequence of steps, where in each step an action whose guard is true has its statement executed –In one step, multiple guards may be true. –If guard of some action is true continuously, then that action would eventually be chosen for execution. Notes •A program computation is a sequence of statesRRRSpecification•A specification is a set of sequences of states. •What does it mean for a program, p to satisfy a specification sp from a set of states S? –every computation of p that starts from a state in S is in sp .RRRExamples of specifications•Let S be a predicate. –invariant : Invariant(S) = {seq: S is true in each state of seq}A sequence seq is in invariant(S) iff S is true in each state in seq. –Closure•Closed(S) = –{seq: Ai: I >= 0:•‘ S is true in the ith state of seq’ => •‘S is true in the (I+1)th state of


View Full Document

MSU CSE 870 - Fault Tolerant Computing

Documents in this Course
HW2

HW2

3 pages

splc1

splc1

21 pages

Lessons

Lessons

3 pages

revision

revision

13 pages

ft1

ft1

12 pages

john.dsn

john.dsn

21 pages

Survey

Survey

2 pages

revision

revision

38 pages

Load more
Download Fault Tolerant Computing
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Fault Tolerant Computing and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Fault Tolerant Computing 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?