Berkeley ELENG C149 - Safety, Fault-tolerance, Verification, and Certification for Embedded Systems - D1798903

Home> Schools> University of California, Berkeley> Electrical Engineering (ELENG) > ELENG C149> Safety, Fault-tolerance, Verification, and Certification for Embedded Systems

Berkeley ELENG C149 - Safety, Fault-tolerance, Verification, and Certification for Embedded Systems

School name University of California, Berkeley

Course Eleng C149- Introduction to Embedded Systems

Pages 53

Download Save

Unformatted text preview:

Guest lecture, UC Berkeley EECS 149, 13 April 2009Safety, Fault-tolerance, Verification, andCertification for Embedded SystemsJohn RushbyComputer Science LaboratorySRI InternationalMenlo Park CA USAJohn Rushby, SR I Safety etc.: 1Overview• It’s pretty hard to get embedded systems working at all• But many embedded systems are used in contexts wherefailures are reallybad newsExpensive: e.g., Prius recallsCatastrophic (to the mission): e.g., crash of Mars PolarLander, several othersDangerous/Deadly: e.g., violent pitching of VH-QPA• Because hardware can fail, critical systems often must befault tolerant• This adds complexity, and the mechanisms for fault toleranceoften become the leading cause of failures• We’ll look at some of these issues, starting with sensors,then computation, then actuatorsJohn Rushby, SR I Safety etc.: 2Sensors: Violent Pitching of VH-QPA• An Airbus A330 en-route from Singapore to Perth on 7October 2008• Started pitching violently, unrestrained passengers hit theceiling, 12 serious injuries, so counts as an accident• Three Angle Of Attack (AOA) sensors, one on left (#1),two on right (#2, #3) of airplane nose• Want to get a consensus good value• Have to deal with inaccuracies, different positions,gusts/spikes, failuresJohn Rushby, SR I Safety etc.: 3A330 AOA Sensor Processing• Sampled at 20Hz• Compare each sensor to the median of the three• If difference is larger than some threshold for more than 1second, flag as faulty and ignore for remainder of flight• Assuming all three are OK, use mean of #1 and #2(because they are on different sides)• If the difference between #1 or #2 and the median is largerthan some (presumably smaller)threshold, use previousaverage value for 1.2 seconds•Failure scenario: two spikes, first shorter than 1 second,second still present 1.2 seconds after detection of first• Spike gets passed though rate limiter, flight envelopeprotections activate inappropriatelyJohn Rushby, SR I Safety etc.: 4Another Example: X29• Three sources of air data: a nose probe and two side probes•Selection algorithm used the data from the nose probe,provided it was within some threshold of the data from bothside probes• The threshold was large to accommodate position errors incertain flight modes• If the nose probe failed to zero at low speed, it would still bewithin the threshold of correct readings, causing the aircraftto become unstable and “depart”• Found in simulation• 162 flights had been at riskJohn Rushby, SR I Safety etc.: 5Sensor Processing: Analysis• This is a difficult issue and there’s no completely satisfactorysolution known (good research problem)• Most algorithms are complex andhomespun• My hunch is that it could be better to deal separately withinaccuracies, position errors, gusts/spikes, failures• Possible approach:intelligent sensor communicates aninterval, not a point value• Width of interval indicates confidence, healthJohn Rushby, SR I Safety etc.: 6Sensor Fusion: Marzullo’s AlgorithmAxiom: if sensor is nonfaulty, its interval contains the truevalueObservation: true value must be in overlap of nonfaultyintervalsConsensus (fused) Interval to tolerate f faults in n, chooseinterval that contains all overlaps of n − f;i.e., from least value contained in n − f intervals to largestvalue contained in n − fEliminating faulty samples: separate problem, not needed forfusing, but any sample disjoint from the fused interval mustbe faultyJohn Rushby, SR I Safety etc.: 7True Value In Overlap of Nonfaulty IntervalsS(2)S(3)S(1)S(4)John Rushby, SR I Safety etc.: 8Marzullo’s Fusion IntervalS(2)S(3)S(1)S(4)John Rushby, SR I Safety etc.: 9Marzullo’s Fusion Interval: Fails Lipschitz ConditionS(2)S(3)S(4)S(1)John Rushby, SR I Safety etc.: 10Schmid’s Fusion Interval• Choose interval from f + 1’st largest lower bound to f + 1’stsmallest upper bound• Optimal among selections that satisfy Lipschitz ConditionJohn Rushby, SR I Safety etc.: 11Schmid’s Fusion IntervalS(2)S(3)S(4)S(1)John Rushby, SR I Safety etc.: 12Compute: Fuel Emergency on G-VATL• An Airbus A340 en-route from Hong Kong to London on 8February 2005• Toward the end of the flight, two engines flamed out, crewfound certain tanks were critically low on fuel, declared anemergency, landed at Amsterdam•Two Fuel Control Monitoring Computers (FCMCs) on thistype of airplane; they cross-compare and the “healthiest” onedrives the outputs to the data bus• Both FCMCs had fault indications, and one of them wasunable to drive the data bus• Unfortunately, this one was judged the healthiest and wasgiven control of the bus even though it could not exercise it• Further backup systems were not invoked because theFCMCs indicated they were not both failedJohn Rushby, SR I Safety etc.: 13Computational Redundancy: Analysis• This is big topic, several approachesSelf-checking pairs: two computers cross-compare,shutdown on disagreement, then another pair takes over(more later)N-modular redundancy: N computers vote on a consensus◦ Exact-match voting, or averaging?◦ Synchronized or unsynchronized?• The separate computers are generally calledchannels• Axiom: failures are independent• Requires they are separateFault Containment Units (FCUs)◦ Physically separate◦ Separate power, cooling, etc.John Rushby, SR I Safety etc.: 14Unsynchronized Designs (e.g., F16)• Channels sample sensors independently, computeindependently• Intuitively maximizes diversity, independence• But cannot expect outputs to match exactly, so needselection, or averaging, as with sensors• Tends to produce homespun solutions• Outputs depend on time integrated values(e.g., velocity, position)◦ Accumulated errors are compounded by clock drift◦ So must exchange and vote integrator values◦ Requires ad-hoc synchronization in the applications code• Redundancy management pervades applications code (asmuch as 70% of the code)John Rushby, SR I Safety etc.: 15Unsynchronized Designs (e.g., F16)sensorsensorsensorcomputecomputecomputeactuatorJohn Rushby, SR I Safety etc.: 16Problems with Unsynchronized Designs• Output selection can induce large transients (cf. Lipschitz)◦ Averaging functions dragged along by faulty values◦ Exclusion on fault detection causes drastic change• Mode switches can cause channel divergence◦ IF x > 100 THEN . . . ELSE . . .Timechange of mode here100◦ Output very sensitive to sample when near decision

View Full Document