CSCI 8530 Advanced Operating Systems Reliability and Fault Tolerance Fault Tolerance The failure rates of real time systems must be extremely small even smaller than the failure rates of the components from which they are built To achieve this goal they must be fault tolerant they must be able to continue operating even if a limited subset of their hardware or software fails Graceful degradation of service is also necessary in the event of failure As the system starts running out of means to deal with failures the less critical tasks are eliminated In the limit system failure occurs which can be catastrophic One goal of the designer is to reduce the probability of such system failure to an acceptable level 2 1 Faults and Errors A fault is a defect that can cause a malfunction It can be in hardware or software For example the Pentium floating point division bug was a hardware fault The C library gets function has a fault that can cause buffer overrun An error is the manifestation of a fault The Pentium floating point fault only causes errors in rare cases when certain operands are used The use of gets only causes an error when an input line exceeds the size of the buffer 3 Fault and Error Latencies Fault latency is the time from the onset of a fault and its manifestation as an error This can be a very long time since faults are usually invisible and are typically only exposed when an error occurs Error latency is the time between when an error is produced and it is either recognized as an error or it causes system failure 4 2 Error Recovery Error recovery is the process by which a system attempts to recover from the effects of an error Forward error recovery masks the error in some manner eliminating the need to repeat any work Backward error recovery causes the system to return to a previous state and then repeat the flawed work using different hardware or software This is sometimes called time redundancy 5 Redundancy Fault tolerant systems must use redundant components in order to continue operation in spite of failure N modular redundancy NMR uses N hardware components e g processors with N usually odd To support up to m failed units N must be at least 2 m 1 Software redundancy requires multiple different software components to perform the same computation Note that this is not the same as running the same algorithm multiple times multiple different algorithms for the same computation are required 6 3 Hardware Redundancy Redundant hardware components can be used in a variety of schemes including at least these Serial replacement With 3 redundant hardware units unit 1 is used until it fails then unit 2 is switched in perhaps while unit 1 is repaired or replaced if unit 2 fails before unit 1 is again available unit 3 can be used Parallel operation with majority logic All units operate at the same time As long as the outputs of all units are identical no failure is identified When the outputs differ then the majority output is used 7 Software Redundancy N version programming uses N different versions of software to perform a computation These different versions may use different algorithms data structures etc and are likely written by different teams of programmers Failure occurs only when a majority of the N versions fail on some input Another approach is to use N versions but only run one at a time An acceptance test is run on the result of the first version and if this fails a second version is used 8 4 Developer Redundancy Another less formal type of redundancy occurs as a result of multiple developers working on the same hardware or software component The expectation is that multiple developers examining the same hardware code algorithm will be more likely to identify a fault than a single developer This is an adaptation of the old adage two heads are better than one NMR would probably require three heads This is one argument presented in support of the reliability of open source software that it is developed modified and maintained by a group of developers not a single individual 9 Detecting an Error How do we detect an error With majority logic an error is indicated if two or more different results are obtained from multiple hardware components or software algorithms that should give the same result With serial logic we can use an acceptance test on results Example x sqrt z if fabs x x z eps ERROR 10 5 Designing for Reliability Reliability cannot be added on to a system the system must be designed to be reliable For obvious reasons it is inappropriate to provide complete redundancy for every hardware and software component in a system The cost would probably be prohibitive Instead system designers must identify those components that are most likely to fail and then provide redundancy for those Likewise it is appropriate to identify those components whose failure would result in catastrophic system failure and provide redundancy for them Primary emory for example has no moving parts but failure would be catastrophic 11 Examples Hardware components with moving parts generally have a higher failure probability than those without moving parts Thus it is more appropriate to use RAID for disks than to use multiple redundant CPUs Provably correct algorithms do not normally require redundancy For example most sorting algorithms have formal correctness proofs Implementations of algorithms especially complex algorithms however may be flawed and so multiple implementations may be appropriate in a reliable system 12 6 Techniques Used for Reliability Several well known techniques are frequently used in reliable systems Some of these include knowing the most probable failure mode applying acceptance testing to results varying arguments slightly examining status results in all cases verifying argument acceptability using self identifying structures providing audit trails using information redundancy 13 Know the Most Probable Failure Mode Before actually implementing the hardware and software for a system know the most probable faults that might be present in the system Once the most probable faults are identified then the appropriate levels of redundancy and error correction forward or backward can be selected Providing recovery from improbable failures isn t wrong but failure to anticipate failures of high probability is always unacceptable 14 7 Use Acceptance Testing on Results Check the results of computation as a sanity check Example A ship may compute its position using GPS Global
View Full Document
Unlocking...