CprE 558: Real-Time SystemsDependable SystemWhy Dependability ?Example for Dependable SystemsAttributes of DependabilityApproaches to Achieving DependabilityFault AvoidanceFault RemovalFault ToleranceFault ForecastingAchieving Dependability - SummaryFault, Error, and FailureFault, Error, and Failure (Contd.)Example of a fault, an error, and a failureFaultFault types - Based on DurationFault types - Based on Duration (Contd.)Fault Types - Based on Nature of OutputFail-stop UnitFault Types - Based on CorrelationSoftware FaultsErrorError RecoveryFailureFailure (Contd.)Classes of faults, errors, and failuresLoad and Fault HypothesisGraceful DegradationDependability Measures - QuantitativeReliabilityAvailabilityDependability Measures - QualitativeDependability Measures - Qualitative (Cont’d)CprE 545 Iowa State UniversityCprE 558: Real-Time SystemsLectures 15-16: Dependability Concepts& Faul-Tolerance2CprE 545 Iowa State UniversityDependable SystemA system is dependable when it is trustworthy enough that reliance can be placed on the service that it delivers. For a system to be dependable, it must beAvailable - e.g., ready for use when we need it.Reliable - e.g., able to provide continuity of service while we are using it.Safe - e.g., does not have a catastrophic consequence on the environment.Secure - e.g., able to preserve confidentiality.3CprE 545 Iowa State University Why Dependability ?With a greater reliance on computers in a variety of safety-critical applications, the consequences of failure and down time have become more severe.For example, in safety-critical applications - such as flight control, medical life support, process control, telecommunication switching, and on-line transaction processing systems - failure of computing resources can cost lives and/or money.4CprE 545 Iowa State University Example for Dependable SystemsThe reliability figure usually stated as a goal for computer systems in commercial aircraft is less than 10^{-9} failures per hour.Modern telephone switching systems achieve a down time of at most one hour in 40 years.Medical life support system.Command and control systems.Process control applications.5CprE 545 Iowa State UniversityAttributes of DependabilityTitle:depend.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.6CprE 545 Iowa State UniversityApproaches to Achieving DependabilityFault Avoidance - how to prevent, by construction, the fault occurrence or introduction.Fault Removal - how to minimize, by verification, the presence of faults.Fault Tolerance - how to provide, by redundancy, a service complying with the specification in spite of faults.Fault Forecasting - how to estimate, by evaluation, the presence, the creation, and the consequence of faults.7CprE 545 Iowa State UniversityFault AvoidanceFault avoidance uses various tools and techniques to design the system in such a manner that the introduction of faults is minimized.A fault avoided is one that does not have to be dealt with at a later time.Techniques used include design methodologies, verification and validation methodologies, modeling, and code inspections and walk-throughs.8CprE 545 Iowa State UniversityFault RemovalFault Removal uses verification and testing techniques to locate faults enabling the necessary changes to be made to the system.The techniques include unit testing and integration testing.It is generally much more expensive to remove a fault than to avoid a fault.9CprE 545 Iowa State UniversityFault ToleranceA system built with fault tolerance capabilities will manage to keep operating, perhaps at a degraded level, in the presence of these faults. In other words, fault-tolerance is informally defined as the ability of a system to deliver the expected service even in the presence of faults.For a system to be fault-tolerant, it must be able to detect, diagnose, confine, mask, compensate and recover from faults.10CprE 545 Iowa State UniversityFault ForecastingIt is possible to observe the behavior of a system and use this information to take action to compensate for faults before they occur.When a system deviates from its normal behavior, even if the behavior continues to meet system specifications, it may be appropriate to reconfigure the system to reduce the stress on a component with a high failure potential.11CprE 545 Iowa State UniversityAchieving Dependability - SummaryFault avoidance and fault tolerance may be seen as constituting dependability procurement: how to provide the system with the ability to deliver the specified service.Fault removal and fault forecasting may be seen as constituting dependability validation: how to reach confidence in the system's ability to deliver the specified service.12CprE 545 Iowa State UniversityFault, Error, and FailureA fault is a deviation in a hardware or software component from its intended function.An error is a manifestation of a fault in a system, in which the logical state of an element differs from its intended value.The time between fault occurrence and the first appearance of an error is called the fault latency.The time between occurrence of an error and its detection is called error latency.13CprE 545 Iowa State UniversityFault, Error, and Failure (Contd.)When the fault-tolerance mechanisms detect an error, they may initiate several actions to handle the fault and contain its errors.Recovery occurs if these actions are successful; otherwise, the system eventually malfunctions and a failure occurs.14CprE 545 Iowa State UniversityExample of a fault, an error, and a failureTitle:fault1.figCreator:fig2dev Version 3.1 Patchlevel 2Preview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.15CprE 545 Iowa State UniversityFaultFaults can arise during all stages in a computer system's evolution - specification, design, development, manufacturing, assembly, and installation - and throughout its operational life.Most faults that occur before full system deployment are discovered through testing and eliminated.Faults that are not removed can reduce a system's dependability when it is in the field.A fault can be classified by
View Full Document