SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint Recovery Daniel J Sorin Milo M K Martin M a r k D Hill David A Wood Computer Sciences Department University of W i s c o n s i n M a d i s o n sorin milo markhill david cs wisc edu ht tp www s wise edu mul t i facet Abstract We develop an availability solution called SafetyNet that uses a unified lightweight checkpoint recovery mechanism to support multiple long latency fault detection schemes At an abstract level SafetyNet logically maintains multiple globally consistent checkpoints of the state of a shared memory muhiprocessor i e processors memor3 and coherence permissions and it recovers to a pre fault checkpoint of the system and re executes if a fault is detected SafetyNet efficiently coordinates checkpoints across the system in logical time and uses logically atomic coherence transactions to free checkpoints of transient coherence state SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an interconnection network switch and its buffered messages Using full system simulation of a 16 way muhiprocessor running commercial workloads we find that SafetyNet a adds statistically insignificant runtime overhead in the common case of fault free execution and b avoids a crash when tolerated faults occur 1 Introduction Availability has become increasingly important as internet services are integrated more tightly into society s infrastructure Availability is particularly crucial for the sharedmemory multiprocessor servers that run the application services and database management systems that must robustly manage business data However unless architectural steps are taken availability will decrease over time as implementations use larger numbers of increasingly unreliable components in search of higher performance 21 43 The high clock frequencies and small circuit dimensions of future systems will increase their susceptibility to This work is supported in part by the National Science Foundation with grants EIA 9971256 CDA 9623632 and CCR 0105721 lntel Graduate Fellowship Sorin IBM Graduate Fellowship Martin two Wisconsin Romnes Fellowships Hill and Wood and donations from Compaq Computer Corporation Intel Corporation IBM and Sun Microsystems 1063 6897 02 17 00 2002 IEEE 123 both transient and permanent faults For example higher frequencies exacerbate crosstalk 3 8 and supply voltage noise 39 and smaller devices and wires suffer more from electromigration and alpha particle disruptions 36 49 Decades of research in fault tolerant systems suggest a path toward addressing this problem Mission critical systems routinely employ redundant processors memories and interconnects e g triple modular redundancy 26 or pair and spare 45 to tolerate a broad class of faults However for many applications the highly competitive commercial market will seek lower overhead solutions For example RAID level 5 31 has been deployed widely because its overhead is 1 Nth for N data disks rather than the 100 overhead for mirroring In contrast to missioncritical systems commercial servers aim for high availability but will accept occasional crashes to improve cost performance Software visible techniques including database logging and clustering help preserve data integrity and service availability in these cases Current servers employ a range of hardware mechanisms to improve availability RAID error correcting codes ECC interconnection network link level retry 18 and duplicate ALUs with processor retry 40 target specific localized faults such as transient bit flips at memory links or ALUs Computer architects seeking system wide coverage currently must integrate a patchwork of localized detection and recovery schemes In this paper we seek a unified lightweight mechanism that provides end to end recovery from a broad class of transient and permanent faults This recovery mechanism can be combined with a wide range of fault detection mechanisms including strong error detection codes e g CRCs redundant processors and ALUs 18 40 redundant threads 37 and system level state checkers 9 By decoupling recovery from detection our approach allows a range of implementations with varying cost performance We develop a lightweight global checkpoint recovery scheme called SafetyNet and we illustrate its abstraction in Figure 1 SafetyNet periodically creates a system wide logical checkpoint SafetyNet checkpoints can span thousands or even millions of execution cycles permitting Recovery Point E E 1 I 7 I Current State of System I 0 Checkpoints Waiting To Be Validated Figure 1 SafetyNet Abstraction In SafetyNet O processors operate on the current state of the system O the system recovers to the recovery point if a fault is detected and s o m e n u m b e r of non current checkpoints can be p e n d i n g validation 4 Node register checlepolms powerful detection mechanisms with long latencies After detecting a fault all processors caches and memories revert to and resume execution from a consistent systemwide state the recovery point Safeo Net is a hardware scheme that requires no changes to any software or the instruction set Moreover SafetyNet has limited impact on the processor coherence protocol and I O subsystem SafetyNet s basic approach is to log all changes to the architected state This presents three main challenges for a lightweight recovery scheme First naively saving previous values before every register update cache write and coherence response would require a prohibitive amount of storage Second all processors caches and memories in a shared memory muttiprocessor must recover to a consistent point For example recovery must ensure that all nodes agree on the coherence ownership and data values of each memory block Third SafetyNet must determine when it is safe to advance the recovery point i e validate a new checkpoint without degrading performance to wait for slow fault detection mechanisms SafetyNet efficiently meets these three challenges as described in Section 2 First logging is reduced by checkpointing at a coarse granularity e g 100 000 cycles Only the first change to a piece of architectural state register memory block or coherence permission within a checkpoint interval requires a log entry reducing the log overhead by one or two orders of magnitude Second
View Full Document
Unlocking...