UMass Amherst CS 677 - WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?

Unformatted text preview:

IntroductionHardware Availability Modular RedundancyAn Analysis of Failures of a Fault-Tolerant SystemImplications of the Analysis of MTBFFault-tolerant ExecutionSoftware modularity through processes and messagesFault containment through fail-fast software modulesSoftware faults are soft -- the Bohrbug/Heisenbug hypothesisProcess-pairs for fault-tolerant executionTransactions for data integrityTransactions for simple fault-tolerant executionFault-tolerant CommunicationFault-tolerant StorageSummaryAcknowledgmentsReferencesTandem TR 85.7WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?Jim GrayJune, 1985Revised November, 1985ABSTRACTAn analysis of the failure statistics of a commercially available fault-tolerant systemshows that administration and software are the major contributors to failure. Variousapproaches to software fault-tolerance are then discussed -- notably process-pairs,transactions and reliable storage. It is pointed out that faults in production software areoften soft (transient) and that a transaction mechanism combined with persistent process-pairs provides fault-tolerant execution -- the key to software fault-tolerance.DISCLAIMERThis paper is not an “official” Tandem statement on fault-tolerance. Rather, it expressesthe author’s research on the topic.______________________An early version of this paper appeared in the proceedings of the German Association forComputing Machinery Conference on Office Automation, Erlangen, Oct. 2-4, 1985.01TABLE OF CONTENTSIntroduction..........................................................................................................................1Hardware Availability Modular Redundancy......................................................................3An Analysis of Failures of a Fault-Tolerant System............................................................5Implications of the Analysis of MTBF.................................................................................9Fault-tolerant Execution.....................................................................................................11Software modularity through processes and messages..............................................11Fault containment through fail-fast software modules..............................................12Software faults are soft -- the Bohrbug/Heisenbug hypothesis................................12Process-pairs for fault-tolerant execution..................................................................14Transactions for data integrity...................................................................................15Transactions for simple fault-tolerant execution.......................................................16Fault-tolerant Communication...........................................................................................18Fault-tolerant Storage........................................................................................................19Summary............................................................................................................................20Acknowledgments.............................................................................................................21References..........................................................................................................................22IntroductionComputer applications such as patient monitoring, process control, online transactionprocessing, and electronic mail require high availability.The anatomy of a typical large system failure is interesting: Assuming, as is usually thecase, that an operations or software fault caused the outage, Figure 1 shows a time line ofthe outage. It takes a few minutes for someone to realize that there is a problem and thata restart is the only obvious solution. It takes the operator about 5 minutes to snapshootthe system state for later analysis. Then the restart can begin. For a large system, theoperating system takes a few minutes to get started. Then the database and datacommunications systems begin their restart. The database restart completes within a fewminutes but it may take an hour to restart a large terminal network. Once the network isup, the users take a while to refocus on the tasks they had been performing. After restart,much work has been saved for the system to perform -- so the transient load presented atrestart is the peak load. This affects system sizing.Conventional well-managed transaction processing systems fail about once every twoweeks [Mourad], [Burman]. The ninety minute outage outlined above translates to 99.6%availability for such systems. 99.6% availability “sounds” wonderful, but hospitalpatients, steel mills, and electronic mail users do not share this view -- a 1.5 hour outageevery ten days is unacceptable. Especially since outages usually come at times of peakdemand [Mourad].These applications require systems which virtually never fail -- parts of the system mayfail but the rest of the system must tolerate failures and continue delivering service. Thispaper reports on the structure and success of such a system -- the Tandem NonStopsystem. It has MTBF measured in years -- more than two orders of magnitude better thanconventional designs.12Hardware Availability Modular RedundancyReliability and availability are different: Availability is doing the right thing within thespecified response time. Reliability is not doing the wrong thing.Expected reliability is proportional to the Mean Time Between Failures (MTBF). Afailure has some Mean Time To Repair (MTTR). Availability can be expressed as aprobability that the system will be available:MTTRMTBFMTBFtyAvailabili In distributed systems, some parts may be available while others are not. In thesesituations, one weights the availability of all the devices (e.g. if 90% of the database isavailable to 90% of the terminals, then the system is .9x.9 = 81% available.)The key to providing high availability is to modularize the system so that modules are theunit of failure and replacement. Spare modules are configured to give the appearance ofinstantaneous repair -- if MTTR is tiny, then the failure is “seen” as a delay rather than afailure. For example, geographically distributed terminal networks frequently have oneterminal in a hundred


View Full Document

UMass Amherst CS 677 - WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?

Download WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT? and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT? 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?