Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3RedundancyHardware redundancy–add extra hardware for detection or tolerating faultsSoftware redundancy–add extra software for detection and possibly tolerating faultsInformation redundancy–extra information, i.e. codesTime redundancy–extra time for performing tasks for fault tolerance1Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Fault ToleranceError DetectionDamage ConfinementError RecoveryFault Treatment2Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Error Detectionideal check–determined solely from specification–complete, correct–check should be independent from system»check fails if system crashesacceptable check–cost–reasonable check, e.g. monitor rate of changediagnostics–performed “by system on system components”–e.g. power-up diagnostics3Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Damage Confinementerror might propagate and spreadidentify boundaries to state beyond which no information exchange has occurreddynamically => hardstatically => e.g. fire wall4Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Error Recoverybackward recovery–state is restored to an earlier state»requires checkpoints–most frequently used–recovery overheadforward recovery–try to make state error-free–need accurate assessment of damage–highly application-dependent5Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Fault Treatmentif transient fault: restart system, go to error-free statesystem repair–on-line, no manual intervention, (automatic)–dynamic system reconfiguration–spare (hot or cold)6Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Fault Coveragemeasure of system’s ability to perform:–fault detection–fault location–fault containment–(and/or fault recovery)C = P(fault recovery | fault existence), Note: –recovery implies that the system as a whole is operational–this does not imply that a “repair” occurred–e.g. duplex system with benign fault can recover to continue operation on one non-faulty processor7Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Hardware RedundancyPassive (static)–uses fault masking to hide occurrence of fault–no action from the system is required–e.g. votingActive (dynamic)–uses comparison for detection and/or diagnoses–remove faulty hardware from system => reconfigurationHybrid–combine both approaches–masking until diagnostic complete–expensive, but better to achieve higher reliability8Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Passive Hardware RedundancyN-Modular Redundancy (NMR)–N independent modules replicate the same function»parallelism–results are voted on–requirements: N >= 3TMR (Triple Modular Redundancy)VVoter:• is single point of failure.• could be very simple, • but who guards the guard?9Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Who guards the guards?Replicate votersVVVRestoring Organ:since it produces 3 correct outputs even if one input is faulty.eliminate single point of failure10Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Who guards the guards?Multistage TMR with replicate votersVVVVVV11Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Votingif inputs are independent, the NMR can mask up to e.g. 1 bit majority voter (3 AND gates ORed)Faults&&&+I1I3I2ZZ=1 if 2 of 3 inputs are 1Z=0 if 2 of 3 inputs are 012Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Flux SummingInherent property of closed loop control systemIf one module becomes faulty, remaining modules compensate automatically.13Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Duplicate and Compare –can only detect, but NOT diagnose»i.e. fault detection, no fault-tolerance–may order shutdown–comparator is single point of failure»simple implementation: 2 input XOR for single bit compareActive Hardware RedundancyCIn OutAgreeM1M214Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Active Hardware RedundancyJohnson 1989 15Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Active Hardware RedundancyStand-by-sparing–only one module is driving outputs–other modules are»idle => hot spares»shut down => cold spares–error detection => switch to a new module–hot spares»no power-up delays»power consumption–cold spares»opposite of hot spares16Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Johnson 1989 17Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Active Hardware RedundancyPair and Spare–duplication combined with compare & spare–2 modules are always on-line –2-of-N switch–pairs are often combined18Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Johnson 1989 19Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Hybrid Hardware RedundancyNMR with spares –N active + S spare modules (off-line)–voting and comparison–replace erroneous module from spare pool–maintains N constant–uses N-of-(N+S) switchexample: 2 faults at 2 different times–hybrid solution => N = 4–passive solution => N = 520Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Johnson 1989 21Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Hybrid Hardware RedundancySelf-purging NMR (Joh89 Fig 3.17)–all modules are active–exclude modules on error detection»vote & compare–N will decrease with faults22Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Johnson 1989 23Page: © 2007 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 3Hybrid Hardware RedundancyTriple-Duplex (Johnson 1989 Fig. 3.26, page 80)–redundant self checking–each node is really 2 modules + comparator–self-disable in event of error–“simulate” benign behavior–triple-triplex used in Boeing 777 primary flight
View Full Document