Page: 1 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT – SIFT = Software Implemented Fault Tolerance » Ultra reliable computer for critical aircraft control applications. (desired probability of failure < 10-10/h over 10h mission) » Developed by SRI International for NASA Langley Research Center » Major goal was the application of formal verification of the critical functions achieve fault-tolerance as much as possible by programs » Byzantine General Problem discovery » Objective for operating system was to hide the fault-tolerant mechanism from the application » Proof-of-concept demonstrator => bad news: 70-80% overhead (due to executives)Page: 2 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT System Overview – structure of the SIFT system Wen78 fig.1Page: 3 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT System Overview (cont.) – loosely coupled multiprocessor system » implements message passing » no shared memory – up to 8 processors » Bendix BDX930 processors, 32kB RAM, λ approx. 10-3 – general processors and I/O processors » have private memory » I/O procs. have less RAM – complete interconnectionPage: 4 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT SIFT Operating System Functions – scheduling » OS must cause application tasks to be executed at proper time. – synchronization » OS must keep processors moderately synchronized (within 50µs) – consistency » OS must implement voting at input to process – communication » results of tasks must be transmitted to all processors in timely manner, i.e. shared resultsPage: 5 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT – fault masking » erroneous data, generated by faulty processor, must not be allowed to propagate, and source of erroneous data should be identified » => NMR – reconfiguration » OS must be able to detect a faulty unit and reconfigure » => graceful degradation (hybrid NMR) – input/output » OS controls received data from sensors and transmitted data to actuators » separate busses => sensors possibly act as faulty generalPage: 6 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT Fault Isolation – provided at the boundaries between processors and buses – processor can read memory of other processors, but can write only to its own memory. – a non-faulty processor can obtain bad data due to bad bus – invalid control signals should not produce incorrect behavior of non-faulty processorsPage: 7 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT Fault Masking – faulty units can provide processors with bad data – tasks place their outputs into a pre-broadcast buffer – this data is then broadcast to other processors – values from all processors are collected in the pre-vote buffer – the voted majority is then placed into the post-vote buffer – buffer management » 8 x 128 word data buffers » 1 buffer per processor » any message received from Processor i goes to Buffer i » each message is tagged with ID (0,...127) => 128 shared variable namesPage: 8 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT buffer management and votingPage: 9 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT Voter – Voting on exact match basis => bit-by-bit majority voting – If the voter cannot find a majority » a default value specified by the application is selected – After identifying faulty unit, system is reconfigured – The number of processors executing a task can vary with the task » dynamic configuration of task redundancy » takes advantage of the fact that in flight control particular tasks are more critical than othersPage: 10 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT Scheduler - outputs must be generated with specified frequency - delay between reading and outputting result must be bounded – table driven, making use of several independent schedules: » task schedule list of tasks to execute, how often, and when. » Pre-vote schedule list of buffers to be moved and data-file to the pre-vote buffer, how often they should be moved, an when. » voter schedule list of buffers to be voted on, how often should be voted and when. » broadcast schedule list of buffers to be broadcast, how often, and when. » copy schedule list of buffers to be copied from one slot in the post-vote buffer to another.Page: 11 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT – Frame based, periodic schedulingPage: 12 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT – Scheduling strategies considered: » non-preemptive fixed priority scheduling did not allow sufficient flexibility » priority scheduling tasks with fastest iteration rate are given highest priority works if utilization is below ln(2) » earliest deadline first not useful, since task durations are very short » simply periodic scheduling all periods are multiples of smaller periods – SIFT implements simple periodic scheduling with integer multiplicity of subframes.Page: 13 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT – Frame base scheduling » each task is assigned to one of several priority levels » each priority level corresponds to an iteration rate » each iteration rate is an integer multiple of the next lower one » subframes are 1.6 to 3.2 ms long » frames are up to 50 subframes long (< 160ms) » each task must fit into one subframe left time is used for background diagnostic task » results are available in next subframe » schedule is table driven and static » at subframe boundary suspend task broadcast all data from last frame invoke voting task start scheduled taskPage: 14 © 2003 A.W. Krings CS449/549 Fault-Tolerant Systems Sequence 29 SIFT Clock Synchronization – observed scenario triggered
View Full Document