CMU CS 15740 - SafetyNet: Improving the Availability of Shared Memory Multiprocessors - D1430018

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15740> SafetyNet: Improving the Availability of Shared Memory Multiprocessors

DOC PREVIEW

CMU CS 15740 - SafetyNet: Improving the Availability of Shared Memory Multiprocessors

School name Carnegie Mellon University

Course Cs 15740- Computer Architecture

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, David A. Wood Computer Sciences Department University of Wisconsin--Madison {sorin, milo, markhill, david)@cs.wisc, edu ht tp ~//www. =s. wise. edu/mul t i facet / Abstract We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi- ple, globally consistent checkpoints of the state of a shared memory muhiprocessor (i.e., processors, memor3; and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes perfor- mance overhead by pipelining checkpoint validation with subsequent parallel execution. We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an inter- connection network switch (and its buffered messages). Using full-system simulation of a 16-way muhiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur. 1 Introduction Availability has become increasingly important as internet services are integrated more tightly into society's infra- structure. Availability is particularly crucial for the shared- memory multiprocessor servers that run the application services and database management systems that must robustly manage business data. However, unless architec- tural steps are taken, availability will decrease over time as implementations use larger numbers of increasingly unre- liable components in search of higher performance [21, 43]. The high clock frequencies and small circuit dimen- sions of future systems will increase their susceptibility to This work is supported in part by the National Science Foundation, with grants EIA-9971256, CDA-9623632, and CCR-0105721, lntel Graduate Fellowship (Sorin), IBM Graduate Fellowship (Martin), two Wisconsin Romnes Fellowships (Hill and Wood), and donations from Compaq Com- puter Corporation, Intel Corporation, IBM, and Sun Microsystems. both transient and permanent faults. For example, higher frequencies exacerbate crosstalk [3, 8] and supply voltage noise [39], and smaller devices and wires suffer more from electromigration and alpha particle disruptions [36, 49]. Decades of research in fault-tolerant systems suggest a path toward addressing this problem. Mission-critical sys- tems routinely employ redundant processors, memories, and interconnects (e.g., triple-modular redundancy [26] or pair-and-spare [45]) to tolerate a broad class of faults. However, for many applications, the highly competitive commercial market will seek lower overhead solutions. For example, RAID level 5 [31] has been deployed widely because its overhead is 1/Nth (for N data disks) rather than the 100% overhead for mirroring. In contrast to mission- critical systems, commercial servers aim for high avail- ability but will accept occasional crashes to improve cost/ performance. Software-visible techniques--including database logging and clustering--help preserve data integ- rity and service availability in these cases. Current servers employ a range of hardware mechanisms to improve availability. RAID, error correcting codes (ECC), interconnection network link-level retry [18], and duplicate ALUs with processor retry [40] target specific, localized faults such as transient bit flips at memory, links, or ALUs. Computer architects seeking system-wide cover- age currently must integrate a patchwork of localized detection and recovery schemes. In this paper, we seek a unified, lightweight mechanism that provides end-to-end recovery from a broad class of transient and permanent faults. This recovery mechanism can be combined with a wide range of fault detection mechanisms, including strong error detection codes (e.g., CRCs), redundant processors and ALUs [18, 40], redun- dant threads [37], and system-level state checkers [9]. By decoupling recovery from detection, our approach allows a range of implementations with varying cost-performance. We develop a lightweight global checkpoint/recovery scheme called SafetyNet, and we illustrate its abstraction in Figure 1. SafetyNet periodically creates a system-wide (logical) checkpoint. SafetyNet checkpoints can span thou- sands or even millions of execution cycles, permitting 1063-6897/02 $17.00 © 2002 IEEE 123Recovery Point I I I . E_E_ 1- -- _ -- -- -] 7 Current State of System 0 Checkpoints Waiting To Be Validated Figure 1. SafetyNet Abstraction. In SafetyNet, O processors operate on the current state of the system, O the system recovers to the recovery point if a fault is detected, and ~ some number of non-current checkpoints can be pending validation. powerful detection mechanisms with long latencies. After detecting a fault, all processors, caches, and memories revert to and resume execution from a consistent system- wide state, the recovery point. Safeo'Net is a hardware scheme that requires no changes to any software or the instruction set. Moreover, SafetyNet has limited impact on the processor, coherence protocol, and I/O subsystem. SafetyNet's basic approach is to log all changes to the architected state. This presents three main challenges for a lightweight recovery scheme. First, naively saving previous values before every register update, cache write, and coher- ence response would require a prohibitive amount of stor- age. Second, all processors, caches, and memories in a shared-memory muttiprocessor must recover to a consis- tent point. For example, recovery must ensure that all nodes agree on the coherence ownership and data values of each memory block. Third, SafetyNet must determine when it is safe to advance the recovery

View Full Document