High Availability Computer Systems Jim Gray Digital Equipment Corporation 455 Market St 7 th Floor San Francisco CA 94105 Daniel P Siewiorek Department of Electrical Engineering Carnegie Mellon University Pittsburgh PA 15213 Abstract The key concepts and techniques used to build high availability computer systems are 1 modularity 2 fail fast modules 3 independent failure modes 4 redundancy and 5 repair These ideas apply to hardware to design and to software They also apply to tolerating operations faults and environmental faults This article explains these ideas and assesses highavailability system trends Overview It is paradoxical that the larger a system is the more critical is its availability and the more difficult it is to make it highly available It is possible to build small ultra available modules but building large systems involving thousands of modules and millions of lines of code is still an art These large systems are a core technology of modern society yet their availability are still poorly understood This article sketches the techniques used to build highly available computer systems It points out that three decades ago hardware components were the major source of faults and outages Today hardware faults are a minor source of system outages when compared to operations environment and software faults Techniques and designs that tolerate this broader class of faults are in their infancy A Historical Perspective Computers built in the late 1950 s offered twelve hour mean time to failure A maintenance staff of a dozen full time customer engineers could repair the machine in about eight hours This failure repair cycle provided 60 availability The vacuum tube and relay components of these computers were the major source of failures they had lifetimes of a few months Therefore the machines rarely operated for more than a day without interruption1 Many fault detection and fault masking techniques used today were first used on these early computers Diagnostics tested the machine Self checking computational techniques detected faults while the computation progressed The program occasionally saved checkpointed its High Availability Paper for IEEE Computer Magazine Draft 1 state on stable media After a failure the program read the most recent checkpoint and continued the computation from that point This checkpoint restart technique allowed longrunning computations to be performed by machines that failed every few hours Device improvements have improved computer system availability By 1980 typical well run computer systems offered 99 availability2 This sounds good but 99 availability is 100 minutes of downtime per week Such outages may be acceptable for commercial back office computer systems that process work in asynchronous batches for later reporting Mission critical and online applications cannot tolerate 100 minutes of downtime per week They require highavailability systems ones that deliver 99 999 availability This allows at most five minutes of service interruption per year Process control production control and transaction processing applications are the principal consumers of the new class of high availability systems Telephone networks airports hospitals factories and stock exchanges cannot afford to stop because of a computer outage In these applications outages translate directly to reduced productivity damaged equipment and sometimes lost lives Degrees of availability can be characterized by orders of magnitude Unmanaged computer systems on the Internet typically fail every two weeks and average ten hours to recover These unmanaged computers give about 90 availability Managed conventional systems fail several times a year Each failure takes about two hours to repair This translates to 99 availability 2 Current fault tolerant systems fail once every few years and are repaired within a few hours 3 This is 99 99 availability High availability systems require fewer failures and faster repair Their requirements are one to three orders of magnitude more demanding than current faulttolerant technologies see Table 1 Table 1 Availability of typical systems classes Today s best systems are in the high availability range The best of the general purpose systems are in the fault tolerant range as of 1990 Unavailability Availability System Type min year Availability Class unmanaged managed well managed fault tolerant high availability 50 000 5 000 500 50 5 High Availability Paper for IEEE Computer Magazine Draft 90 99 99 9 99 99 99 999 1 2 3 4 5 2 very high availability ultra availability 5 05 99 9999 99 99999 6 7 As the nines begin to pile up in the availability measure it is better to think of availability in terms of denial of service measured in minutes per year So for example 99 999 availability is about 5 minutes of service denial per year Even this metric is a little cumbersome so the concept of availability class or simply class is defined by analogy to the hardness of diamonds or the class of a cleanroom Availability class is the number of leading nines in the availability figure for a system or module More formally if the system availability is A the system s availability class is elog10 The rightmost column of Table 1 tabulates the availability classes of various system types The telephone network is a good example of a high availability system a class 5 system Its design goal is at most two outage hours in forty years Unfortunately over the last two years there have been several major outages of the United States telephone system a nation wide outage lasting eight hours and a mid west outage lasting four days This shows how difficult it is to build systems with high availability Production computer software typically has more than one defect per thousand lines of code When millions of lines of code are needed the system is likely to have thousands of software defects This seems to put a ceiling on the size of high availability systems Either the system must be small or it must be limited to a failure rate of one fault per decade For example the tenmillion line Tandem system software is measured to have a thirty year failure rate3 High availability requires systems designed to tolerate faults to detect the fault report it mask it and then continue service while the faulty component is repaired offline Beyond the prosaic hardware and software faults a high availability system must tolerate the following sample faults Electrical power at a typical site in North
View Full Document