UNO CSCI 8530 - Reliability and Fault-Tolerance - D1624111

Home> Schools> University of Nebraska Omaha> (CSCI) > CSCI 8530> Reliability and Fault-Tolerance

DOC PREVIEW

UNO CSCI 8530 - Reliability and Fault-Tolerance

School name University of Nebraska Omaha

Course Csci 8530-

Pages 13

This preview shows page 1-2-3-4 out of 13 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 13 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CSCI 8530Advanced Operating SystemsReliability and Fault-Tolerance2Fault ToleranceThe failure rates of real-time systems must be extremely small, even smaller than the failure rates of the components from which they are built!To achieve this goal they must be fault-tolerant: they must be able to continue operating even if a limited subset of their hardware or software fails.Graceful degradation of service is also necessary in the event of failure. As the system starts running out of means to deal with failures, the less critical tasks are eliminated. In the limit, system failure occurs, which can be catastrophic. One goal of the designer is to reduce the probability of such system failure to an acceptable level.23Faults and ErrorsA fault is a defect that cancause a malfunction. It can be in hardware or software. For example, the Pentium floating-point division bug was a hardware fault. The C-library getsfunction has a fault that can cause buffer overrun.An error is the manifestation of a fault. The Pentium floating point fault only causes errors in rare cases when certain operands are used. The use of getsonly causes an error when an input line exceeds the size of the buffer. 4Fault and Error LatenciesFault latency is the time from the onset of a fault and its manifestation as an error. This can be a very long time, since faults are usually invisible, and are typically only exposed when an error occurs.Error latency is the time between when an error is produced and it is either recognized as an error or it causes system failure.35Error RecoveryError recovery is the process by which a system attempts to recover from the effects of an error.Forward error recovery masks the error in some manner, eliminating the need to repeat any work.Backward error recovery causes the system to return to a previous state and then repeat the flawed work using different hardware or software. This is sometimes called time redundancy.6RedundancyFault-tolerant systems must use redundant components in order to continue operation in spite of failure.N-modular redundancy (NMR) uses Nhardware components (e.g. processors), with Nusually odd. To support up to mfailed units, Nmust be at least 2 ×m +1.Software redundancy requires multiple different software components to perform the same computation. Note that this is not the same as running the same algorithm multiple times; multiple different algorithms for the same computation are required.47Hardware RedundancyRedundant hardware components can be used in a variety of schemes, including at least these:Serial replacement. With 3 redundant hardware units, unit 1 is used until it fails, then unit 2 is switched in (perhaps while unit 1 is repaired or replaced); if unit 2 fails before unit 1 is again available, unit 3 can be used.Parallel operation, with majority logic. All units operate at the same time. As long as the outputs of all units are identical, no failure is identified. When the outputs differ, then the majority output is used.8Software RedundancyN -version programming uses Ndifferent versions of software to perform a computation. These different versions may use different algorithms, data structures, etc. and are likely written by different teams of programmers. Failure occurs only when a majority of the N versions fail on some input.Another approach is to use Nversions, but only run one at a time. An acceptance test is run on the result of the first version, and if this fails, a second version is used.59Developer RedundancyAnother, less formal, type of redundancy occurs as a result of multiple developers working on the same hardware or software component.The expectation is that multiple developers examining the same hardware/code/algorithm will be more likely to identify a fault than a single developer. This is an adaptation of the old adage “two heads are better than one.” (NMR would probably require three heads.)This is one argument presented in support of the reliability of open-source software – that it is developed, modified, and maintained by a group of developers, not a single individual.10Detecting an ErrorHow do we detect an error?With majority logic, an error is indicated if two (or more!) different results are obtained from multiple hardware components or software algorithms that should give the same result.With serial logic, we can use an acceptance test on results.Example:x = sqrt(z); if (fabs(x*x-z) > eps) ERROR…611Designing for ReliabilityReliability cannot be “added on” to a system; the system must be designed to be reliable.For obvious reasons, it is inappropriate to provide complete redundancy for every hardware and software component in a system. The cost would probably be prohibitive.Instead, system designers must identify those components that are most likely to fail, and then provide redundancy for those.Likewise, it is appropriate to identify those components whose failure would result in catastrophic system failure, and provide redundancy for them. Primary emory, for example, has no moving parts, but failure would be catastrophic.12ExamplesHardware components with moving parts generally have a higher failure probability than those without moving parts. Thus it is more appropriate to use RAID for disks than to use multiple redundant CPUs.Provably correct algorithms do not normally require redundancy. For example, most sorting algorithms have formal correctness proofs.Implementations of algorithms (especially complex algorithms), however, may be flawed, and so multiple implementations may be appropriate in a reliable system.713Techniques Used for ReliabilitySeveral well-known techniques are frequently used in reliable systems.Some of these includeknowing the most probable failure modeapplying acceptance testing to resultsvarying arguments slightlyexamining status results in all casesverifying argument acceptabilityusing self-identifying structuresproviding audit trailsusing information redundancy14Know the Most Probable Failure ModeBefore actually implementing the hardware and software for a system, know the most probable faults that might be present in the system.Once the most probable faults are identified, then the appropriate levels of redundancy and error correction (forward or backward) can be selected.Providing recovery from improbable failures isn’t wrong, but failure to anticipate failures of high probability is always unacceptable.815Use Acceptance Testing on ResultsCheck the results of computation, as

View Full Document