Berkeley COMPSCI 258 - ReVive - Cost-Effective Architectural Suppor - D384322

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 258> ReVive - Cost-Effective Architectural Suppor

DOC PREVIEW

Berkeley COMPSCI 258 - ReVive - Cost-Effective Architectural Suppor

School name University of California, Berkeley

Course Compsci 258- Parallel Processors

Pages 12

This preview shows page 1-2-3-4 out of 12 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 12 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors * Milos Prvulovic, Zheng Zhang:~ t, Josep Torrellas University of Illinois at Urbana-Champaign Hewlett-Packard Laboratories http://iacoma.cs.uiuc.edu Abstract This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, per- formance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the perma- nent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are per- formed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free ex- ecution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day. 1 Introduction Cache-coherent shared-memory multiproeessors are seeing wi- despread use in commercial, technical, and scientific applications. In recent years, fault-tolerance has become an increasingly impor- tant feature of such systems. In some commercial applications, high availabilizy is needed, as business transactions are being pro- cessed by the system. Some applications execute for a long time and require a highly reliable execution environment. Examples of such applications are those that mine large data sets and many simulations. Unfortunately, both availability and reliability are dif- ficult to achieve in modern large systems. Improvements in silicon technology result in smaller feature sizes, while power dissipation constraints result in lower operating voltages. Both of these make modern integrated circuits prone to transient and permanent faults. In large systems the problem is worse, as those systems contain many interacting components that must all operate correctly. To deal with these problems, much work has been done in er- ror recovery. Typically, error recovery mechanisms are categorized into Forward and Backward Error Recovery (FER and BER). With FER, hardware redundancy is added to the system, which makes it possible to determine the correct outcome of an operation, even *This work was supported in part by the National Science Founda- tion under grants CCR-9970488, EIA-0081307, EIA-0072102, and CHE- 0121357; by DARPA under grant F30602-01-C-0078; and by gifts from IBM, lntel, and Hewlen-Packard. "tCurrently at Microsoft Research Asia: [email protected]. if one (or more) of the participating devices fails. It is possible to design cost-effective FER that targets only a single device, such as the processor core [3, 28, 30]. However, general-purpose FER is not cheap. The most popular such method is triple-modular re- dundancy (TMR), in which each operation is performed by three identical devices and a majority vote decides the correct result. For most systems, the cost of TMR is prohibitively high. BER, also called rollback recovery or checkpointing, can be used in such sys- tems. With rollback recovery, the system stores information about its past state. When an error is detected, this information allows the system to be restored into a previous error-free state. The main ad- vantage of BER is that no hardware replication is required. How- ever, it has three disadvantages: the performance overhead during error-free execution, storage overhead, and the higher recovery la- tency. In this paper, we present ReVive, a novel, cost-effective scheme for rollback recovery in shared-memory multiprocessors with dis- tributed memory. ReVive is compatible with off-the-shelf proces- sors, caches, and memory modules. It only requires modifications to the directory controllers of the machine, to perform memory- based distributed parity protection and logging in the background. Both hardware and storage requirements are very modest. ReVive has both good error-free performance and quick recov- ery from a wide class of errors, including permanent loss of an entire node. Our experiments with 12 applications on a simulated 16-processor system show that the average overhead of error-free execution is only 6.3%. When an error occurs, the system is un- available for less than half a second on average, including the cor- rect work lost due to the rollback. The resulting availability is better than 99.999%, even when errors occur as often as once per day. This paper is organized as follows: Section 2 presents a novel taxonomy of BER schemes for multiprocessors; Section 3 presents the design of ReVive; Section 4 explains some implementation is- sues in ReVive; Section 5 presents our evaluation setup; Section 6 contains the evaluation; Section 7 describes related work; finally, Section 8 concludes. 2 BER in Multiprocessors: A Taxonomy To understand the design space of BER schemes, we have de- signed a taxonomy that classifies the schemes according to three axes: how checkpoint consistency is achieved, how the separation between the checkpoint and the working data is done, and how checkpoint storage is protected from errors. Figure 1 shows the resulting design space. We now consider each axis in turn. 1063-6897/02 $17.00 © 2002 IEEE 111.... ~ ..... .~ ReVive Separation ..,,"" ...':'" F Logging f" ......... i I i -:S~ sk Storage Partial [Renaming .: ~ I [__ Buffering ~x.¢5 kxxx~...::,J ..... ..' /'~.,. ,. Y , Consistency ,'" I I Global I Coordinated Uncoordinated I Local Figure 1. Design space of multiprocessor BER schemes. 2.1 Checkpoint Consistency Since threads executing on different processors interact with each other, they may create recovery dependences: when one pro- cessor is rolled back, it may be necessary to also roll back other processors. To maintain checkpoint consistency, three approaches are used: Global. All processors periodically synchronize to create a sin- gle, global checkpoint [8, 13, 14, 15, 20, 21]. This is the simplest approach. Coordinated Local. Each processor periodically

View Full Document