DOC PREVIEW
Berkeley COMPSCI 258 - ReVive - Cost-Effective Architectural Suppor

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors * Milos Prvulovic, Zheng Zhang:~ t, Josep Torrellas University of Illinois at Urbana-Champaign Hewlett-Packard Laboratories http://iacoma.cs.uiuc.edu Abstract This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, per- formance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the perma- nent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are per- formed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free ex- ecution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day. 1 Introduction Cache-coherent shared-memory multiproeessors are seeing wi- despread use in commercial, technical, and scientific applications. In recent years, fault-tolerance has become an increasingly impor- tant feature of such systems. In some commercial applications, high availabilizy is needed, as business transactions are being pro- cessed by the system. Some applications execute for a long time and require a highly reliable execution environment. Examples of such applications are those that mine large data sets and many simulations. Unfortunately, both availability and reliability are dif- ficult to achieve in modern large systems. Improvements in silicon technology result in smaller feature sizes, while power dissipation constraints result in lower operating voltages. Both of these make modern integrated circuits prone to transient and permanent faults. In large systems the problem is worse, as those systems contain many interacting components that must all operate correctly. To deal with these problems, much work has been done in er- ror recovery. Typically, error recovery mechanisms are categorized into Forward and Backward Error Recovery (FER and BER). With FER, hardware redundancy is added to the system, which makes it possible to determine the correct outcome of an operation, even *This work was supported in part by the National Science Founda- tion under grants CCR-9970488, EIA-0081307, EIA-0072102, and CHE- 0121357; by DARPA under grant F30602-01-C-0078; and by gifts from IBM, lntel, and Hewlen-Packard. "tCurrently at Microsoft Research Asia: [email protected]. if one (or more) of the participating devices fails. It is possible to design cost-effective FER that targets only a single device, such as the processor core [3, 28, 30]. However, general-purpose FER is not cheap. The most popular such method is triple-modular re- dundancy (TMR), in which each operation is performed by three identical devices and a majority vote decides the correct result. For most systems, the cost of TMR is prohibitively high. BER, also called rollback recovery or checkpointing, can be used in such sys- tems. With rollback recovery, the system stores information about its past state. When an error is detected, this information allows the system to be restored into a previous error-free state. The main ad- vantage of BER is that no hardware replication is required. How- ever, it has three disadvantages: the performance overhead during error-free execution, storage overhead, and the higher recovery la- tency. In this paper, we present ReVive, a novel, cost-effective scheme for rollback recovery in shared-memory multiprocessors with dis- tributed memory. ReVive is compatible with off-the-shelf proces- sors, caches, and memory modules. It only requires modifications to the directory controllers of the machine, to perform memory- based distributed parity protection and logging in the background. Both hardware and storage requirements are very modest. ReVive has both good error-free performance and quick recov- ery from a wide class of errors, including permanent loss of an entire node. Our experiments with 12 applications on a simulated 16-processor system show that the average overhead of error-free execution is only 6.3%. When an error occurs, the system is un- available for less than half a second on average, including the cor- rect work lost due to the rollback. The resulting availability is better than 99.999%, even when errors occur as often as once per day. This paper is organized as follows: Section 2 presents a novel taxonomy of BER schemes for multiprocessors; Section 3 presents the design of ReVive; Section 4 explains some implementation is- sues in ReVive; Section 5 presents our evaluation setup; Section 6 contains the evaluation; Section 7 describes related work; finally, Section 8 concludes. 2 BER in Multiprocessors: A Taxonomy To understand the design space of BER schemes, we have de- signed a taxonomy that classifies the schemes according to three axes: how checkpoint consistency is achieved, how the separation between the checkpoint and the working data is done, and how checkpoint storage is protected from errors. Figure 1 shows the resulting design space. We now consider each axis in turn. 1063-6897/02 $17.00 © 2002 IEEE 111.... ~ ..... .~ ReVive Separation ..,,"" ...':'" F Logging f" ......... i I i -:S~ sk Storage Partial [Renaming .: ~ I [__ Buffering ~x.¢5 kxxx~...::,J ..... ..' /'~.,. ,. Y , Consistency ,'" I I Global I Coordinated Uncoordinated I Local Figure 1. Design space of multiprocessor BER schemes. 2.1 Checkpoint Consistency Since threads executing on different processors interact with each other, they may create recovery dependences: when one pro- cessor is rolled back, it may be necessary to also roll back other processors. To maintain checkpoint consistency, three approaches are used: Global. All processors periodically synchronize to create a sin- gle, global checkpoint [8, 13, 14, 15, 20, 21]. This is the simplest approach. Coordinated Local. Each processor periodically


View Full Document

Berkeley COMPSCI 258 - ReVive - Cost-Effective Architectural Suppor

Documents in this Course
Load more
Download ReVive - Cost-Effective Architectural Suppor
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view ReVive - Cost-Effective Architectural Suppor and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view ReVive - Cost-Effective Architectural Suppor 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?