The Reliability of Life-Critical Computer System

Home> Academic Documents> The Reliability of Life-Critical Computer System

DOC PREVIEW

This preview shows page 1-2-21-22 out of 22 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 22 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Acta Informatica 23, 621-642 (1986) 9 Springer-Verlag 1986 The Reliability of Life-Critical Computer Systems* Robert Geist 1, Mark Smotherman 1, Kishor Trivedi 2, and Joanne Bechta Dugan2 1 Department of Computer Science, Clemson University, Clemson, SC 29631, USA 2 Department of Computer Science, Duke University, Durham, NC 27706, USA Summary. In order to aid the designers of life-critical, fault-tolerant com- puting systems, accurate and efficient methods for reliability prediction are needed. The accuracy requirement implies the need to model the system in great detail, and hence the need to address the problems of large state space, non-exponential distributions, and error analysis. The efficiency re- quirement implies the need for new model solution techniques, in particular the use of decomposition/aggregation in the context of a hybrid model. We describe a model for reliability prediction which meets both requirements. Specifically, our model is partitioned into fault occurrence and fault/error handling submodels, which are represented by non-homogeneous Markov processes and extended stochastic Petri nets, respectively. The overall ag- gregated model is a stochastic process that is solved by numerical tech- niques. Methods to analyze the effects of variations in input parameters on the resulting reliability predictions are also provided. 1. Introduction Fault-tolerant computer systems, that is, systems capable of continued correct operation in the presence of either operational or design faults, are experienc- ing an ever-increasing range of important applications [2]. When these appli- cations are life-critical, such as in flight control, fault-tolerance becomes the vehicle used to enhance system reliability, that is, the probability that the system will remain operational throughout the mission time. Nevertheless, the extensive reliability demands placed upon these life-criti- cal systems (reliability >1-10 -9 ) preclude simple procedures by which we might verify that the systems possess the level of reliability for which they were designed. As the probability of system failure approaches 0, lifetesting and * This work was supported in part by NASA grant NAGI-70 and by an equipment grant from the Concurrent Computer Corp622 R. Geist et al. simulation techniques become too expensive to remain feasible [18, 25]. More- over, standard stochastic models (e.g., Markov, semi-Markov) sufficiently com- prehensive to include details of fault/error-handling behavior, as well as details of the fault processes and system structure, are generally intractable. In such models a massive state space (typically 105 states [24]) is often coupled with a collection of state transitions having rate variations of several orders of magni- tude. A fault process may exhibit a rate on the order of 10-4/h, whereas a recovery process within the same model could have a rate of 106/h. Standard modeling then yields a massive system of stiff differential or integral equations with the attendant solution difficulties. As a result, most models used to predict ultra-high reliability (reliability >1-10-9), including the one described herein, resort to a behavioral decom- position of the system along temporal lines into nearly disjoint fault-occurrence and fault/error-handling submodels [-11, 12, 26, 27]. The fault/error-handling model is solved in isolation and the resulting effectiveness measures, which are termed coverage factors, are then aggregated with the fault occurrence behavior to obtain the prediction of system reliability as a function of mission time. Of course, this behavioral decomposition leads to certain approximations, and the effects of these approximations, as well as the potential ill-effects from other possible sources of modeling error, must be identified and bounded. The remainder of the paper is organized as follows. In Section 2 we discuss our fault-occurrence model, which describes the fault processes and system structure, as well as the aggregation technique through which we capture the effectiveness measures produced by the fault/error-handling submodel. The latter model is described in Section 3, where we use a framework of Extended Stochastic Petri Nets [5, 6] to capture the process concurrency inherent in fault/error-handling techniques. In Section 4 we discuss error analysis, and in Section 5 we give examples of the implementation of our techniques in HARP (Hybrid Automated Reliability Predictor), now in development jointly at Clemson University and Duke University under the sponsorship of NASA Langley Research Center. 2. The Fault-Occurrence Model A fault is an erroneous state of hardware or software resulting from physical interference from the environment, failures of components, operator error, or incorrect design [22]. An error is the manifestation of a fault within the system output. Thus a fault may or may not cause an error. Undetected errors, or detected errors from which we cannot effect recovery, may propagate through the system and cause system failure. The state of a fault-tolerant system will be a vector whose integer coor- dinates indicate the number of failed components of each type, and whose single binary coordinate indicates whether the system is operational (function- ing properly) or failed. We will find it convenient to identify all failed states with a single, absorbing state, which we hereafter term the system failure state. Note that transition to the system failure state may result from either anReliability of Life-Critical Computer Systems 623 exhaustion of component redundancy or an inability to recover from the effects of a fault. Consider now the general non-homogeneous Markov process which, sui- tably restricted, has been the basis for virtually all of the recent efforts to develop useful reliability prediction packages [3, 4, 11, 19, 26]. If we let P(t) =(Po(t) .... , P,(t)) denote the time-dependent probability


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-21-22 out of 22 pages.

Please select your school