DOC PREVIEW
CMU CS 15740 - SafetyNet: Improving the Availability of Shared Memory Multiprocessors

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood Computer Sciences Department University of Wisconsin--Madison {sorin, milo, markhill, david)@cs.wisc, edu ht tp ~//www. =s. wise. edu/mul t i facet / Abstract We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi- ple, globally consistent checkpoints of the state of a shared memory muhiprocessor (i.e., processors, memor3; and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes perfor- mance overhead by pipelining checkpoint validation with subsequent parallel execution. We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an inter- connection network switch (and its buffered messages). Using full-system simulation of a 16-way muhiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur. 1 Introduction Availability has become increasingly important as internet services are integrated more tightly into society's infra- structure. Availability is particularly crucial for the shared- memory multiprocessor servers that run the application services and database management systems that must robustly manage business data. However, unless architec- tural steps are taken, availability will decrease over time as implementations use larger numbers of increasingly unre- liable components in search of higher performance [21, 43]. The high clock frequencies and small circuit dimen- sions of future systems will increase their susceptibility to This work is supported in part by the National Science Foundation, with grants EIA-9971256, CDA-9623632, and CCR-0105721, lntel Graduate Fellowship (Sorin), IBM Graduate Fellowship (Martin), two Wisconsin Romnes Fellowships (Hill and Wood), and donations from Compaq Com- puter Corporation, Intel Corporation, IBM, and Sun Microsystems. both transient and permanent faults. For example, higher frequencies exacerbate crosstalk [3, 8] and supply voltage noise [39], and smaller devices and wires suffer more from electromigration and alpha particle disruptions [36, 49]. Decades of research in fault-tolerant systems suggest a path toward addressing this problem. Mission-critical sys- tems routinely employ redundant processors, memories, and interconnects (e.g., triple-modular redundancy [26] or pair-and-spare [45]) to tolerate a broad class of faults. However, for many applications, the highly competitive commercial market will seek lower overhead solutions. For example, RAID level 5 [31] has been deployed widely because its overhead is 1/Nth (for N data disks) rather than the 100% overhead for mirroring. In contrast to mission- critical systems, commercial servers aim for high avail- ability but will accept occasional crashes to improve cost/ performance. Software-visible techniques--including database logging and clustering--help preserve data integ- rity and service availability in these cases. Current servers employ a range of hardware mechanisms to improve availability. RAID, error correcting codes (ECC), interconnection network link-level retry [18], and duplicate ALUs with processor retry [40] target specific, localized faults such as transient bit flips at memory, links, or ALUs. Computer architects seeking system-wide cover- age currently must integrate a patchwork of localized detection and recovery schemes. In this paper, we seek a unified, lightweight mechanism that provides end-to-end recovery from a broad class of transient and permanent faults. This recovery mechanism can be combined with a wide range of fault detection mechanisms, including strong error detection codes (e.g., CRCs), redundant processors and ALUs [18, 40], redun- dant threads [37], and system-level state checkers [9]. By decoupling recovery from detection, our approach allows a range of implementations with varying cost-performance. We develop a lightweight global checkpoint/recovery scheme called SafetyNet, and we illustrate its abstraction in Figure 1. SafetyNet periodically creates a system-wide (logical) checkpoint. SafetyNet checkpoints can span thou- sands or even millions of execution cycles, permitting 1063-6897/02 $17.00 © 2002 IEEE 123Recovery Point I I I . E_E_ 1- -- _ -- -- -] 7 Current State of System 0 Checkpoints Waiting To Be Validated Figure 1. SafetyNet Abstraction. In SafetyNet, O processors operate on the current state of the system, O the system recovers to the recovery point if a fault is detected, and ~ some number of non-current checkpoints can be pending validation. powerful detection mechanisms with long latencies. After detecting a fault, all processors, caches, and memories revert to and resume execution from a consistent system- wide state, the recovery point. Safeo'Net is a hardware scheme that requires no changes to any software or the instruction set. Moreover, SafetyNet has limited impact on the processor, coherence protocol, and I/O subsystem. SafetyNet's basic approach is to log all changes to the architected state. This presents three main challenges for a lightweight recovery scheme. First, naively saving previous values before every register update, cache write, and coher- ence response would require a prohibitive amount of stor- age. Second, all processors, caches, and memories in a shared-memory muttiprocessor must recover to a consis- tent point. For example, recovery must ensure that all nodes agree on the coherence ownership and data values of each memory block. Third, SafetyNet must determine when it is safe to advance the recovery


View Full Document

CMU CS 15740 - SafetyNet: Improving the Availability of Shared Memory Multiprocessors

Documents in this Course
leecture

leecture

17 pages

Lecture

Lecture

9 pages

Lecture

Lecture

36 pages

Lecture

Lecture

9 pages

Lecture

Lecture

13 pages

lecture

lecture

25 pages

lect17

lect17

7 pages

Lecture

Lecture

65 pages

Lecture

Lecture

28 pages

lect07

lect07

24 pages

lect07

lect07

12 pages

lect03

lect03

3 pages

lecture

lecture

11 pages

lecture

lecture

20 pages

lecture

lecture

11 pages

Lecture

Lecture

9 pages

Lecture

Lecture

10 pages

Lecture

Lecture

22 pages

Lecture

Lecture

28 pages

Lecture

Lecture

18 pages

lecture

lecture

63 pages

lecture

lecture

13 pages

Lecture

Lecture

36 pages

Lecture

Lecture

18 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

lecture

lecture

34 pages

lecture

lecture

47 pages

lecture

lecture

7 pages

Lecture

Lecture

18 pages

Lecture

Lecture

7 pages

Lecture

Lecture

21 pages

Lecture

Lecture

10 pages

Lecture

Lecture

39 pages

Lecture

Lecture

11 pages

lect04

lect04

40 pages

Load more
Download SafetyNet: Improving the Availability of Shared Memory Multiprocessors
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view SafetyNet: Improving the Availability of Shared Memory Multiprocessors and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view SafetyNet: Improving the Availability of Shared Memory Multiprocessors 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?