DOC PREVIEW
CMU CS 15740 - Fault Tolerance & Recovery

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Fault Tolerance & Recovery Kevin Bowers & Jordan Wales As transistors get smaller, wires get shorter, and electric charges go down, computers become more and more susceptible to external factors. As fewer electrons are used to hold state, it becomes easier for cosmic ray strikes to alter transistors. While such strikes do not occur frequently, they do occur, and can have disastrous consequences. Especially in large multiprocessor systems, the odds that one machine in the group will be affected, go up dramatically. As such, it becomes necessary to both detect and recover from such failures, if and when they occur. The first paper, Detailed Design and Evaluation of Redundant Multithreading Alternatives by Mukherjee, Kontz, and Reinhardt explores using redundant threads in order to detect failures, at which point the instruction can be re-issued. This provides both a simple detection and recovery scheme. They discuss implementing this technique both by running the two threads on a single processor, or splitting them between two. They show that if the threads can be run on separate processors, a number of advantages are gained. We'll quickly look at the technique applied on a single processor before highlighting the advantages of multiple processors. The first idea many people had, was simply to run the two threads in lockstep on separate processors, where each would progress a cycle and the output compared to ensure both threads agreed. This worked well and was able to detect both transient and permanent faults when they occurred. A simple re-issue of the instruction was then used to determine the correct output. While this technique is simple and easy to both understand and implement, it proves to be very costly in terms of performance. Since both processes are run on different hardware, and in lockstep, both incur branch mispredictions and cache misses. By running in lockstep, both processes finish at the same time. In order to avoid the branch mispredictions and cache miss penalties, the idea of simultaneous multithreading was introduced. The concept here was that one processes would run ahead of the second, forwarding information to the trailing thread. In this manner, the trailing thread would avoid the penalties of cache misses, as the data could reliably be prefetched, as well as the cost of branch mispredictions, as the leading thread could tell the trailing thread which way to predict. When run on two processors, this technique is called Chip-level Redundant Threading (CRT). An added advantage of having a leading and trailing thread is that the two can then be run on a single processor by simply switching execution between the leading and trailing threads. This is termed Simultaneous Redundant Threading (SRT). SRT simply delays the commit of any instruction until the trailing thread completes and the outputs can be compared. Unfortunately, since both threads are running on the same hardware, SRT is not able to detect permanent faults, but only transient ones. This is probably acceptable since permanent faults must be dealt with, while transient faults can be recovered from automatically. It also has the advantage of not needing a second processor, something which is sometimes hard to come by.The biggest benefit of this work, however, comes when CRT is employed among many processes on many processors. Since the trailing thread does significantly less work than the leading thread, there will be extra processor cycles available on the machine running the trailing thread. If there are two processes running, each with a redundant thread, it is possible to match the leading thread of one with the trailing thread of the other on the first processor, and vice versa on the second processor, to achieve enormous performance gains. The second paper we looked at, SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery by Sorin, Martin, Hill and Wood, covers another technique for recovery in the face of unexpected failure, likely do to transient effects. Essentially SafetyNet, the software they develop, is a global checkpoint of system state. A snapshot is taken of the system, every 100,000 cycles, and that snapshot is then fed to the verifier, which runs in the background. While the system continues to operate, the verifier ensures that the snapshot is consistent. Namely, it ensures that all operations between processors appears atomic. That is, either the operation happens entirely, or not at all. Once this has been verified, a recovery point is marked. If the system were to crash, that snapshot could be used to restart the system from a consistent state. One of the advantages of this technique is that it can be implemented entirely in hardware. Programs do not need to be recompiled or modified, making deployment in operational systems easier. This also means that the work can be done without affecting the processes running, providing recovery without significant cost, as compared to the 30% cost of running a redundant thread, as mentioned above. Finally, because this is only a recovery procedure, it is possible to apply any detection method, possibly one which detects without running multiple instances, again avoiding the performance penalties of redundant threads. Providing robustness in the face of cosmic ray strikes seems to be a bit of overkill, but in reality, as chips shrink, and more and more machines are linked together, the potential of a cosmic ray strike is a significant factor in mean time to failure for a system. In order to avoid this problem, measures must be taken to detect and recover from both transient and permanent failures. We have here outlined two such methods, namely redundant threading and global checkpointing. The first detects and recovers from both failures, at the cost of a 30% overhead. The second ignores the detection, but stores consistent system states to which recovery can happen, when failures are


View Full Document

CMU CS 15740 - Fault Tolerance & Recovery

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download Fault Tolerance & Recovery
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Fault Tolerance & Recovery and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Fault Tolerance & Recovery 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?