MIT 6 826 - Availability and Replication - D1149540

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 826> Availability and Replication

DOC PREVIEW

MIT 6 826 - Availability and Replication

School name Massachusetts Institute of Technology

Course 6 826- Principles of Computer Systems

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

6.826—Principles of Computer Systems 2000 Handout 28. Availability and Replication 1 28. Availability and Replication This handout explains the basic issues in building highly available computer systems, and describes in some detail the specs and code for a replicated service with state. What is availability? A system is available if it delivers service promptly. Exactly what this means is something that has to be specified. For example, the spec might say that an ATM must deliver money from a local bank account to the user within 15 seconds, or that an airline reservation system must respond to user input within 1 second. The definition of availability is the fraction of offered load that gets prompt service; usually it’s more convenient to measure the probability p that a request is not serviced promptly. If requests come in at a certain rate, say 1/minute, with a memoryless distribution (that is, what happens to one request doesn’t depend on other requests; a tossed coin is memoryless, for example), then p is also the probability that not all requests arriving in one minute get service. If this probability is small then the time between bad minutes is 1/p minutes. This is called the ‘mean time to failure’ or MTTF; sometimes ‘mean time between failures’ or MBTF is used instead. Changing the time scale of course doesn’t change the MTTF: the probability of a bad hour is 60p, so the time between bad hours is 1/60p hours = 1/p minutes. If p = .00001 then there are 5 bad minutes per year. In a big system something is always broken, and usually we care about the service that one stream of customers sees rather than about whether the system is perfect, so we use the availability of one terminal to measure the MTTF. We focus on systems that fail and are repaired. While the system is failed, it provides no service. After it’s repaired, it provides perfect service until it fails again. If MTTF is the mean time to failure and MTTR is the mean time to repair, then the availability is p = MTTR/(MTTF + MTTR) If MTTR/MTTF is small, we have approximately p = MTTR/MTTF Note that doubling MTTF halves p, and so does halving the MTTR. The two factors are equally important. This simple point is often overlooked. Redundancy There are basically two ways to make a system available. One is to build it out of components that fail very seldom. This is good if you can do it, because it keeps the system simple. However, if there are n components and each fails independently with small probability pc, then the system fails with probability n pc. As n grows, this number grows too. Furthermore, it is often expensive to make highly reliable components. 6.826—Principles of Computer Systems 2000 Handout 28. Availability and Replication 2 The other way to make a system available is to use redundancy, so that the system can work even if some of its components have failed. There are two main patterns of redundancy: retry and replication. Retry is redundancy in time: fail, repair, and try again. If failures are intermittent, repair doesn’t require any action. In this case 1/MTBF is the probability of failure, and MTTR is the time required to detect the failure and try again. Often the failure detector is a timeout; then the MTTR is the timeout interval plus the retry time. Thus in retry, timeouts are critical to availability. Replication is physical redundancy, or redundancy in space: have several copies, so that one can do the work even if another fails. The most common form of replication is ‘primary-backup’ or ‘hot standby’, in which the system normally uses the primary component, but ‘fails over’ to a backup if the primary fails. This is very much like retry: the MTTR is the failover time, which is the time to detect the failure plus the time to make the backup live. This is a completely general form of redundancy. Error correcting codes are a more specialized form. Another completely general form of replication is to have several replicas that operate in lockstep and interact with the rest of the world only between steps. At the end of each step, compare the outputs of the replicas. If there’s a majority for some output value, that value is the output of the replicated system, and any replica that produced a different value is declared faulty and should be repaired. At least three replicas are needed for this to work; when there are exactly three it’s called ‘triple modular redundancy’, TMR for short. A common variation that simplifies the handling of outputs is ‘pair and spare’, which uses four replicas arranged in two pairs. If the outputs of a pair disagree, it is declared faulty and the other pair’s output is the system output. A weaker form of physical replication (that is, one that tolerates fewer failures) is an error correcting code. It’s easy to apply this to storing and transmitting data; we cite some examples below. A computer system has three major components: processing, storage, and communication. Here is how to apply redundancy to each of them. • In communication intermittent errors are common and retry is simply retransmitting a message. If messages can take different paths, component failures often look like intermittent errors because a retry will use different components. It’s also possible to use error-correcting codes (called ‘forward error correction’ in this context), but usually the error rate is low enough that this isn’t cost effective. • In storage retry is not so easy, but error correcting codes still work well. ECC memory using Hamming codes, the elaborate codes used on disk drives, and RAID disks are all examples of this. Straightforward replication, usually called ‘mirroring’, is also popular. • In processing error correcting codes usually can’t handle arbitrary state transitions. Retry is only possible if you have the old state, so it’s usually coded in a transaction system. The replicated state machines that we studied in handout 18 are fully general, however, and can make any kind of processing highly available. Using these methods to replicate a processor at6.826—Principles of Computer Systems 2000 Handout 28. Availability and Replication 3 the instruction set level is tricky but possible.1 People also use lockstep replication at the instruction level, usually pair-and-spare, but such systems can’t use standard components above the chip level, and it’s very expensive to

View Full Document