DOC PREVIEW
The Reliability of Life-Critical Computer System

This preview shows page 1-2-21-22 out of 22 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 22 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Acta Informatica 23, 621-642 (1986) 9 Springer-Verlag 1986 The Reliability of Life-Critical Computer Systems* Robert Geist 1, Mark Smotherman 1, Kishor Trivedi 2, and Joanne Bechta Dugan2 1 Department of Computer Science, Clemson University, Clemson, SC 29631, USA 2 Department of Computer Science, Duke University, Durham, NC 27706, USA Summary. In order to aid the designers of life-critical, fault-tolerant com- puting systems, accurate and efficient methods for reliability prediction are needed. The accuracy requirement implies the need to model the system in great detail, and hence the need to address the problems of large state space, non-exponential distributions, and error analysis. The efficiency re- quirement implies the need for new model solution techniques, in particular the use of decomposition/aggregation in the context of a hybrid model. We describe a model for reliability prediction which meets both requirements. Specifically, our model is partitioned into fault occurrence and fault/error handling submodels, which are represented by non-homogeneous Markov processes and extended stochastic Petri nets, respectively. The overall ag- gregated model is a stochastic process that is solved by numerical tech- niques. Methods to analyze the effects of variations in input parameters on the resulting reliability predictions are also provided. 1. Introduction Fault-tolerant computer systems, that is, systems capable of continued correct operation in the presence of either operational or design faults, are experienc- ing an ever-increasing range of important applications [2]. When these appli- cations are life-critical, such as in flight control, fault-tolerance becomes the vehicle used to enhance system reliability, that is, the probability that the system will remain operational throughout the mission time. Nevertheless, the extensive reliability demands placed upon these life-criti- cal systems (reliability >1-10 -9 ) preclude simple procedures by which we might verify that the systems possess the level of reliability for which they were designed. As the probability of system failure approaches 0, lifetesting and * This work was supported in part by NASA grant NAGI-70 and by an equipment grant from the Concurrent Computer Corp622 R. Geist et al. simulation techniques become too expensive to remain feasible [18, 25]. More- over, standard stochastic models (e.g., Markov, semi-Markov) sufficiently com- prehensive to include details of fault/error-handling behavior, as well as details of the fault processes and system structure, are generally intractable. In such models a massive state space (typically 105 states [24]) is often coupled with a collection of state transitions having rate variations of several orders of magni- tude. A fault process may exhibit a rate on the order of 10-4/h, whereas a recovery process within the same model could have a rate of 106/h. Standard modeling then yields a massive system of stiff differential or integral equations with the attendant solution difficulties. As a result, most models used to predict ultra-high reliability (reliability >1-10-9), including the one described herein, resort to a behavioral decom- position of the system along temporal lines into nearly disjoint fault-occurrence and fault/error-handling submodels [-11, 12, 26, 27]. The fault/error-handling model is solved in isolation and the resulting effectiveness measures, which are termed coverage factors, are then aggregated with the fault occurrence behavior to obtain the prediction of system reliability as a function of mission time. Of course, this behavioral decomposition leads to certain approximations, and the effects of these approximations, as well as the potential ill-effects from other possible sources of modeling error, must be identified and bounded. The remainder of the paper is organized as follows. In Section 2 we discuss our fault-occurrence model, which describes the fault processes and system structure, as well as the aggregation technique through which we capture the effectiveness measures produced by the fault/error-handling submodel. The latter model is described in Section 3, where we use a framework of Extended Stochastic Petri Nets [5, 6] to capture the process concurrency inherent in fault/error-handling techniques. In Section 4 we discuss error analysis, and in Section 5 we give examples of the implementation of our techniques in HARP (Hybrid Automated Reliability Predictor), now in development jointly at Clemson University and Duke University under the sponsorship of NASA Langley Research Center. 2. The Fault-Occurrence Model A fault is an erroneous state of hardware or software resulting from physical interference from the environment, failures of components, operator error, or incorrect design [22]. An error is the manifestation of a fault within the system output. Thus a fault may or may not cause an error. Undetected errors, or detected errors from which we cannot effect recovery, may propagate through the system and cause system failure. The state of a fault-tolerant system will be a vector whose integer coor- dinates indicate the number of failed components of each type, and whose single binary coordinate indicates whether the system is operational (function- ing properly) or failed. We will find it convenient to identify all failed states with a single, absorbing state, which we hereafter term the system failure state. Note that transition to the system failure state may result from either anReliability of Life-Critical Computer Systems 623 exhaustion of component redundancy or an inability to recover from the effects of a fault. Consider now the general non-homogeneous Markov process which, sui- tably restricted, has been the basis for virtually all of the recent efforts to develop useful reliability prediction packages [3, 4, 11, 19, 26]. If we let P(t) =(Po(t) .... , P,(t)) denote the time-dependent probability


The Reliability of Life-Critical Computer System

Download The Reliability of Life-Critical Computer System
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view The Reliability of Life-Critical Computer System and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view The Reliability of Life-Critical Computer System 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?