Toward Monitoring Fault-Tolerant Embedded Systems

Home> Academic Documents> Toward Monitoring Fault-Tolerant Embedded Systems

DOC PREVIEW

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

APPEARS IN THE 1ST INTL. WORKSHOP ON SOFTWARE HEALTH MANAGEMENT (SHM 2009), 2009. 1Toward Monitoring Fault-Tolerant EmbeddedSystems (Extended Abstract)Alwyn Goodloe and Lee PikeF1 INTRODUCTIONFlight-critical systems for aircraft and spacecraft mustbe ultra-reliable and so are designed to be fault-tolerant.For embedded control systems and data buses, a primarymeans for achieving fault-tolerance is hardware replica-tion to provide redundancy so that the system can sur-vive random hardware faults of individual components.Nevertheless, a system may fail to meet its reliabilityrequirements for one of three reasons: (1) the systemsuffers an unexpectedly high number of hardware faults,(2) hardware faults lead to unexpected system-level fail-ures, or (3) software or design bugs result in systematicfaults. Indeed, failures may result from a combinationof these reasons (e.g., an unanticipated hardware faultstriggers the execution of fault-management software thatis incorrectly designed, leading to system failure).A monitor [1], [2] is a runtime verification mechanismthat observes the behavior of a system and detects if itis consistent with its specified correct behavior. Ultra-reliable systems stand to benefit from runtime monitors,if monitors can be constructed to increase their reliability.However, ultra-reliable systems, which are often dis-tributed, real-time, fault-tolerant systems, have largelybeen ignored by the monitoring community. The chal-lenges associated with monitoring this class of systemsinclude the following:• How to ensure the monitor does not interfere withthe monitored system delivering its services andmeeting its real-time deadlines.• How to ensure the monitoring infrastructure doesnot reduce the reliability of the monitored system.• How to monitor a fault-tolerant system, since fault-tolerant systems are distributed to provide replica-tion.• How to monitor for faults.We propose these are all important open research ques-tions.In this extended abstract, we motivate the need forruntime monitoring for ultra-reliable systems. We mo-tivate this need by first presenting an example of aAlwyn Goodloe is with the National Institute of Aerospace. Email:[email protected] Pike is with Galois, Inc. Email: [email protected] in the Space Shuttle’s data processing system.The system was ostensibly designed with best-practicesin mind yet still managed to suffer a failure. Afterpresenting the motivating example, we touch on someof these open questions.2 FAILURE IN THE SPACE SHUTTLEThe Space Shuttle’s data processing system has fourgeneral purpose computers (GPC) that operate in a re-dundant set. There are also twenty-three multiplexer de-multiplexers (MDM) units aboard the orbiter, sixteen ofwhich are directly connected to the GPCs via redundantshared busses. Each of these MDMs receives commandsfrom guidance navigation and control (GNC) running onthe GPCs and acquires the requested data from sensorsattached to it, which is then sent back to the GPCs. In ad-dition to their role in multiplexing/demultiplexing data,these MDM units perform analog/digital conversion.FA2GPC1GPC2 GPC3GPC4Fig. 1. Shuttle Data Processing System (GPCs and FA2)The GPCs execute redundancy-management algo-rithms that include a fault detection, isolation, and recov-ery (FDIR) function. During the launch of shuttle flightSpace Transportation System 124 (STS-124), there wasa report of a pre-launch failure of the fault diagnosissoftware caused by a “non-universal I/O error” in thesecond flight aft (FA2) MDM [3], which is polled by theGPCs as shown in Figure 1. According to reports [3], [4],the events unfolded as follows:• A diode failed on the serial multiplexer interfaceadapter of the FA2 MDM.• GPC 4 receives erroneous data from FA2. Each nodevotes and views GPC 4 as providing faulty data.Hence GPC 4 is voted out of the redundant set.APPEARS IN THE 1ST INTL. WORKSHOP ON SOFTWARE HEALTH MANAGEMENT (SHM 2009), 2009. 2• Three seconds later, GPC 2 also receives erroneousdata from FA2. In this case, GPC 2 is voted out ofthe redundant set.• In accordance with the Space Shuttle flight rules [5],GPC 2 and GPC 4 are powered down.• GPC 3 then reads FA2’s built-in test equipment anddetermines that it is faulty at which point it too isremoved from redundancy set leaving only GPC 1at which time engineers terminated the work andthe problem with FA2 was isolated and the unitreplaced.The above set of events sequentially removed good GPCnodes, but failed to detect and act on the faulty MDM.3 BYZANTINE FAULTSBased on the motivating example we have just pro-vided, we discuss approaches to monitoring fault-tolerant systems.3.1 Classifying FaultsFaults can be classified according to the hybrid faultmodel of Thambidurai and Park [6]. The classificationis based on the observable behavior of a node, rangingfrom easy-to-detect faults, like fail-silent nodes that failto respond to pings within a nominal time frame, tomore nefarious faults. A particularly nefarious class offaults are asymmetric or Byzantine faults in which a nodesends different messages to different receivers when theexpectation is that the node should broadcast the samemessage to all receivers [7].Byzantine faults are often transient rather than perma-nent, making them difficult to reproduce. For example,a transmitter failing to drive a signal sufficiently highor low can produce Byzantine faults in which differentreceivers observe a broadcasted message differently; tim-ing errors in real-time systems can also produce Byzan-tine faults [8]. In part because of their transience, systemengineers underestimate the probability of Byzantinefaults.3.2 Reconsidering the Space Shuttle FailureA maximum fault assumption (MFA) for a system char-acterizes the maximum kind, number, and arrival rateof faults under which the system is hypothesized tooperate correctly. If the MFA is violated, the system’sassumptions about its environment are violated, and thesystem may behave arbitrarily.The Space Shuttle incident is not isolated: the prelim-inary findings on the A330 in-flight upset exhibited anasymmetric fault [9]. In the case of both the Space Shuttleand the A330, it appears that the systems were designedto satisfy a MFA that did not accommodate asymmetricfaults.It is conceivable (we are speculating here) that thedesigners chose a fault-model that excludes asymmetricfaults because the designers judged that the probabilityof their occurrence


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

Please select your school