Duke ECE 254 - Concepts in Fault Tolerance - D2162233

Home> Schools> Duke University> Electrical & Computer Egr (ECE) > ECE 254> Concepts in Fault Tolerance

Duke ECE 254 - Concepts in Fault Tolerance

Course Ece 254- Fault-Tolerant and Testable Computing Systems

Pages 14

Download Save

Unformatted text preview:

Page 1ECE 254 / CPS 225Fault Tolerant and Testable Computing SystemsConcepts in Fault ToleranceCopyright 2004 Daniel J. SorinDuke UniversityECE 254 / CPS 225 2(C) 2004 Daniel J. SorinOutline• Basic Concepts– Forward Error Recovery– Backward Error Recovery– Redundancy in its Many Forms• Physical Redundancy• Error Detecting/Correcting Codes• Temporal Redundancy Techniques• Backward Error Recovery TechniquesECE 254 / CPS 225 3(C) 2004 Daniel J. SorinRecovering from Errors• Two basic approaches– Forward Error Recovery (FER)– Backward Error Recovery (BER)• FER: continue to go forward in presence of errors– Use redundancy to mask effects of errors– E.g., have a co-pilot that can seamlessly take over airplane• BER: go backward to recover from errors– Use redundancy to enable recovery to saved good state of system– E.g., go back to old saved version of file that you corruptedECE 254 / CPS 225 4(C) 2004 Daniel J. SorinForward Error Recovery• Canonical example: triple modular redundancy (TMR)– Majority voter chooses correct output– Masks error in any one of the three modulesmodulemajorityvotermodulemoduleECE 254 / CPS 225 5(C) 2004 Daniel J. SorinBackward Error Recovery• Canonical examples– Periodic checkpoint/recovery– Logging of changes to system state• BER designs tend to be more complicated– We’ll spend more time on them later in this part of the courseECE 254 / CPS 225 6(C) 2004 Daniel J. SorinVery Rough Comparison: FER vs. BERHigherLowerDesign complexityLowerHigherHardware costSlow recoveryNo slowdownPerformance if faultsLittle degradationSome degradationFault-free performanceBERFERFeaturePage 2ECE 254 / CPS 225 7(C) 2004 Daniel J. SorinPerformance of FER vs. BERfault rateperformanceFERBERWarning: do not take this graph too seriously. The relative heights of the curves and their shapes are gross estimates that do not correspond to any particular system.ECE 254 / CPS 225 8(C) 2004 Daniel J. SorinSystem Design SpaceBackward Error RecoveryForward Error Recovery PCs and laptopsSystems tend to get only 2 out of 3 featuresECE 254 / CPS 225 9(C) 2004 Daniel J. SorinTypes of Redundancy• Physical (spatial)• Temporal• Design• InformationECE 254 / CPS 225 10(C) 2004 Daniel J. SorinPhysical (Spatial) Redundancy• Physically replicate a module– Most obvious approach• Design issues– How many replicas are needed?» For error detection?» For error correction?– How are errors detected/corrected?– Is the redundancy “active” or “passive”?• Canonical example: triple modular redundancy (TMR)– 3 replicas– Errors corrected by majority voter– Redundancy is passive (no special action taken if error detected)ECE 254 / CPS 225 11(C) 2004 Daniel J. SorinTemporal Redundancy• Replicate the actions on a module using the same module, but at a different time• Effective for tolerating transient faults– Can this help at all for hard faults?• We’re going to spend more time on this later!ECE 254 / CPS 225 12(C) 2004 Daniel J. SorinDesign Redundancy• Use multiple different designs to guard against a fault in any of them• Disadvantage: very costly to have multiple different designs– Also: doesn’t tolerate fault in design specification – only tolerates fault in implementation of that specification• Examples– “N-version programming” – have N design teams develop N different versions of a piece of software• We’ll come back to this later in the course, mostly in the context of software fault tolerancePage 3ECE 254 / CPS 225 13(C) 2004 Daniel J. SorinInformation Redundancy• For a given k-bit piece of information, add r check bits to it that make it possible to detect/correct errors in the original k-bit information• Example: parity bit– By adding a single bit to a word of information, we can detect any single-bit error in it• Example: checksum– Sender computes a number (checksum) that nearly uniquely identifies long stream of information that it sends, and it sends checksum along with information– Receiver uses same algorithm to compute checksum on information it receives (not including checksum). If receiver’s checksum matches sender’s checksum, then information is error-free• We’ll talk much more about this soonECE 254 / CPS 225 14(C) 2004 Daniel J. SorinOutline• Basic Concepts• Physical Redundancy• Error Detecting/Correcting Codes• Re-Execution Techniques• Backward Error Recovery TechniquesECE 254 / CPS 225 15(C) 2004 Daniel J. SorinPhysical Redundancy: TMR• Strengths– Tolerates an error in any single module– Tolerates soft and hard errors– Simple design– Negligible performance penalty, even when faults occur• Weaknesses– Can’t tolerate multiple faults» Can’t tolerate any faults after a latent hard fault– Expensive hardware (3x cost)– Single point of failure at voter– Can’t tolerate errors due to design faults … why not?ECE 254 / CPS 225 16(C) 2004 Daniel J. SorinPhysical Redundancy: NMR• N-modular redundancy (N is an odd integer)– Why is N odd?• Can tolerate more errors than TMR– Tolerates up to N/2 – ½ errors• Cost = N*cost of module• Still has single point of failure at voter!– But voter is simple and can be designed to be very robust• One solution to single voter problem– “Restoring organ” = TMR with triplicated voter– How does this help?ECE 254 / CPS 225 17(C) 2004 Daniel J. SorinPhysical Redundancy: Active vs. Passive• NMR is passive, since the system doesn’t really do anything different when an error is detected • With active redundancy, the system detects the error, locates/diagnoses it, and reconfigures to tolerate it• Standby sparing has at least two redundant modules– Operational module has internal error detection mechanism– If a hard error is detected, the system reconfigures to use a spare» “Cold standby”: standby was inactive and must be warmed up» “Hot standby”: standby was active and is in correct stateECE 254 / CPS 225 18(C) 2004 Daniel J. SorinMore Active Redundancy• Pair-and-spare– Like standby sparing, except each module is a pair– This pair compares outputs to detect errors– If error detected, a spare module (i.e., pair) is configured inmodulecomparemodulemodulemodulecomparemodulemoduleswitchPage 4ECE 254 / CPS 225 19(C) 2004 Daniel J. SorinHybrid Physical Redundancy• Combine passive and active redundancy• Example: NMR with spares – Let’s say we have 5

View Full Document


School:
Email:
New Password:
Confirm Password:

Duke ECE 254 - Concepts in Fault Tolerance

Sign up for free to view:

Please select your school