DOC PREVIEW
MIT 16 070 - Failure Reports

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Fesq, 5/9/01 1 16.070Failure Reports5/9/01 Lecture #34 16.070• Examine failure reports due to software! Mars Pathfinder! Ariane V! Mars Polar Lander! Therac 25Fesq, 5/9/01 2 16.070Mars Pathfinder Priority Inversion• A few days after landing, spacecraft began experiencing total systemresets, each resulting in losses of data• Pathfinder contained an "information bus" -- shared memory area usedto pass info between tasks− Access to bus was synchronized with mutual exclusion locks (mutexes)− Bus mgmt task: frequent, high priority to move data in/out of info bus− Meteorological data gathering task: infrequent, low priority. Acquires info busmutex, writes to bus, then releases mutex− If interrupt caused bus mgmt task to be scheduled while mutex was held, and ifbus mgmt task then attempted to acquire same mutex, bus mgmt task is blocked,waiting until meteorological task releases mutex− Problem: Medium priority communications task (long-running) scheduled whilehigh-priority bus mgmt task is blocked.Fesq, 5/9/01 3 16.070Flight #1 of the Ariane V - The Controller Design• Flight Control System! On-Board Computer executes flight program and controls solid boosternozzles and Vulcain cryogenic engine, via servo-valves and hydraulicactuators! Inertial Reference System (SRI)− Measures launcher attitude and movements in space− Has internal computer which calculates angles and velocities based on info from a"strap-down" inertial platform, with laser gyros and accelerometers− Data from SRI transmitted through data-bus to On-Board Computer (OBC)Fesq, 5/9/01 4 16.070 Ariane V Robustness• Redundancy used to improve reliability! Two SRIs operating in parallel with identical hardware/software! One SRI active, one "hot" stand-by! If OBC detects that active SRI has failed, it switches to other SRI, if it isfunctioning properly• Two OBCs• Other Flight Control System units are also duplicatedFesq, 5/9/01 5 16.070Inheritance from Ariane IV• SRIs of Ariane V essentially common to a system which flies onAriane IV! Software which caused the interruption in SRI computers is used beforelaunch− to align inertial reference system− In Ariane IV, to enable rapid re-alignment of system in case of late hold incountdown! Realignment function does not serve any purpose on Ariane V! Realignment function retained for commonality reasons and allowed tooperate for ~40 seconds after liftoff.Fesq, 5/9/01 6 16.070Ariane V Explosion: Chain of Events• Launcher started to disintegrate at L+39 sec due to high aerodynamicloads due to angle of attack >20° that led to separation of boostersfrom main stage, in turn triggering self-destruct• Angle of attack caused by full nozzle deflection of solid boosters andVulcain main engine• Nozzle deflections commanded by OBC software based on data fromactive SRI-2. Part of these data showed diagnostic bit pattern of SRI-2computer, which was interpreted as flight data• SRI-2 sent diagnostic data because unit had declared a failure due to asoftware exception• OBC could not switch to back-up SRI-1 because that unit had alreadyceased to function for same reason as SRI-2Fesq, 5/9/01 7 16.070Ariane V Explosion: Chain of Events (cont.)• Internal SRI software exception! Caused during execution of data conversion from 64-bit floating point to16-bit signed integer value.! Floating point number larger than what could be represented by 16-bitsigned integer! Resulted in Operand Error! Data conversion instructions were not protected from Operand Error,although other conversions of comparable variables were protected• Operand Error occurred due to unexpected high value of HorizontalBias (BH), horizontal velocity sensed by platform.• BH value much higher than expected because Ariane V trajectorydiffers from Ariane IV trajectory and results in considerably higherhorizontal velocity valuesFesq, 5/9/01 8 16.070Build-in Software Protection• Why wasn't the Horizontal Bias variable conversion protected?! Maximum workload target of 80% set for SRI computer " not allconversions could be protected! Analysis performed on every operation which could produce an exceptionto determine vulnerability of unprotected code! Seven variables were at risk of leading to Operand Error! Protection added to four of the seven variables, three left unprotected,including Horizontal Bias− Physically limited?− Large margin of safety?Fesq, 5/9/01 9 16.070Inappropriate Response to Software Error• Spec: In the event of any kind of exception, SRI processor should beshut down! Shut-down is typical response for random hardware failures! In this case, caused shutdown of two healthy critical units! Not appropriate for software design errors! Loss of critical softwarefunction! What might be a more appropriate response?− Restart? Not feasible -- too difficult to recalculate attitude− Computer to provide best estimate of required attitude information− Prime and redundant units developed independentlyFesq, 5/9/01 10 16.070Code that Caused Exception Was Not Required After Launch• Operand Error occurred in software that performs alignment of strap-down inertial platform! Complex math filter functions align x-axis to gravity axis and find northfrom Earth rotation sensing! Assumes launcher is positioned at known, fixed position• Software module computes meaningful results only before lift-off.After lift-off, this function serves no purpose• Alignment function designed to continue after flight for Ariane IV toallow for count-down restart (after hold) without waiting for normalalignment (~45 minutes)• Not required for Ariane V, but preserved for commonality reasons ("Ifit ain't broke, don't fix it")! Ran for ~40 seconds after lift-offFesq, 5/9/01 11 16.070Conclusions and Recommendations• Limitations of alignment software not fully analyzed and implicationsof allowing it to continue to function during flight not realized• No software function should run during flight unless needed• No sensor should stop sending data• Identify all implicit assumptions made by the code; e.g., data rangesand restrictions on use of equipment• Confine exceptions to tasks and devise backup capabilities• Redefine critical components, taking software failures into accountFesq, 5/9/01 12 16.070Mars Polar Lander• Plausible Failure Modes! Premature shutdown of descent engines (compelling evidence)! Surface conditions exceed landing design


View Full Document

MIT 16 070 - Failure Reports

Documents in this Course
optim

optim

20 pages

Load more
Download Failure Reports
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Failure Reports and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Failure Reports 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?