# The Reliability of Life-Critical Computer System

**View the full content.**View Full Document

0 0 21 views

**Unformatted text preview:**

Acta Informatica 23 621 642 1986 9 Springer Verlag1986 The Reliability of Life Critical Computer Systems Robert Geist 1 Mark Smotherman 1 Kishor Trivedi 2 and Joanne Bechta Dugan2 1 Department of Computer Science Clemson University Clemson SC 29631 USA 2 Department of Computer Science Duke University Durham NC 27706 USA Summary In order to aid the designers of life critical fault tolerant computing systems accurate and efficient methods for reliability prediction are needed The accuracy requirement implies the need to model the system in great detail and hence the need to address the problems of large state space non exponential distributions and error analysis The efficiency requirement implies the need for new model solution techniques in particular the use of decomposition aggregation in the context of a hybrid model We describe a model for reliability prediction which meets both requirements Specifically our model is partitioned into fault occurrence and fault error handling submodels which are represented by non homogeneous Markov processes and extended stochastic Petri nets respectively The overall aggregated model is a stochastic process that is solved by numerical techniques Methods to analyze the effects of variations in input parameters on the resulting reliability predictions are also provided 1 Introduction Fault tolerant computer systems that is systems capable of continued correct operation in the presence of either operational or design faults are experiencing an ever increasing range of important applications 2 When these applications are life critical such as in flight control fault tolerance becomes the vehicle used to enhance system reliability that is the probability that the system will remain operational throughout the mission time Nevertheless the extensive reliability demands placed upon these life critical systems reliability 1 1 0 9 preclude simple procedures by which we might verify that the systems possess the level of reliability for which they were designed As the probability of system failure approaches 0 lifetesting and This work was supported in part by NASA grant NAGI 70 and by an equipment grant from the Concurrent Computer Corp 622 R Geist et al simulation techniques become too expensive to remain feasible 18 25 Moreover standard stochastic models e g Markov semi Markov sufficiently comprehensive to include details of fault error handling behavior as well as details of the fault processes and system structure are generally intractable In such models a massive state space typically 105 states 24 is often coupled with a collection of state transitions having rate variations of several orders of magnitude A fault process may exhibit a rate on the order of 10 4 h whereas a recovery process within the same model could have a rate of 106 h Standard modeling then yields a massive system of stiff differential or integral equations with the attendant solution difficulties As a result most models used to predict ultra high reliability reliability 1 10 9 including the one described herein resort to a behavioral decomposition of the system along temporal lines into nearly disjoint fault occurrence and fault error handling submodels 11 12 26 27 The fault error handling model is solved in isolation and the resulting effectiveness measures which are termed coverage factors are then aggregated with the fault occurrence behavior to obtain the prediction of system reliability as a function of mission time Of course this behavioral decomposition leads to certain approximations and the effects of these approximations as well as the potential ill effects from other possible sources of modeling error must be identified and bounded The remainder of the paper is organized as follows In Section 2 we discuss our fault occurrence model which describes the fault processes and system structure as well as the aggregation technique through which we capture the effectiveness measures produced by the fault error handling submodel The latter model is described in Section 3 where we use a framework of Extended Stochastic Petri Nets 5 6 to capture the process concurrency inherent in fault error handling techniques In Section 4 we discuss error analysis and in Section 5 we give examples of the implementation of our techniques in H A R P Hybrid Automated Reliability Predictor now in development jointly at Clemson University and Duke University under the sponsorship of N A S A Langley Research Center 2 The Fault Occurrence Model A fault is an erroneous state of hardware or software resulting from physical interference from the environment failures of components operator error or incorrect design 22 An error is the manifestation of a fault within the system output Thus a fault may or may not cause an error Undetected errors or detected errors from which we cannot effect recovery may propagate through the system and cause system failure The state of a fault tolerant system will be a vector whose integer coordinates indicate the number of failed components of each type and whose single binary coordinate indicates whether the system is operational functioning properly or failed We will find it convenient to identify all failed states with a single absorbing state which we hereafter term the system failure state Note that transition to the system failure state may result from either an Reliability of Life CriticalComputer Systems 623 exhaustion of component redundancy or an inability to recover from the effects of a fault Consider now the general non homogeneous Markov process which suitably restricted has been the basis for virtually all of the recent efforts to develop useful reliability prediction packages 3 4 11 19 26 If we let P t Po t P t denote the time dependent probability vector for operational non system failure states 0 n and let A t a j t denote the associated matrix of state transition rates where a j denotes the rate of transition from state i to state j then the Markovian assumption yields the set of differential equations P t P t A t with equivalent integral formulation t t P t E P x x e x dx 1 i jO whose solution gives the system reliability R t Pi t Without loss of generi ality we hereafter assume that the operational state i is merely a non negative integer and that the only state transitions are those to the state i 1 one more covered fault and to the system failure state one more uncovered fault We then have t t p a t Pi x ai i l x e Iax u dU d x where i O 2 0 t