Practical Byzantine Fault Tolerance and Proactive Recovery MIGUEL CASTRO Microsoft Research and BARBARA LISKOV MIT Laboratory for Computer Science Our growing reliance on online services accessible on the Internet demands highly available systems that provide correct service without interruptions Software bugs operator mistakes and malicious attacks are a major cause of service interruptions and they can cause arbitrary behavior that is Byzantine faults This article describes a new replication algorithm BFT that can be used to build highly available systems that tolerate Byzantine faults BFT can be used in practice to implement real services it performs well it is safe in asynchronous environments such as the Internet it incorporates mechanisms to defend against Byzantine faulty clients and it recovers replicas proactively The recovery mechanism allows the algorithm to tolerate any number of faults over the lifetime of the system provided fewer than 1 3 of the replicas become faulty within a small window of vulnerability BFT has been implemented as a generic program library with a simple interface We used the library to implement the first Byzantine fault tolerant NFS file system BFS The BFT library and BFS perform well because the library incorporates several important optimizations the most important of which is the use of symmetric cryptography to authenticate messages The performance results show that BFS performs 2 faster to 24 slower than production implementations of the NFS protocol that are not replicated This supports our claim that the BFT library can be used to build practical systems that tolerate Byzantine faults Categories and Subject Descriptors C 2 0 Computer Communication Networks General Security and protection C 2 4 Computer Communication Networks Distributed Systems Client server D 4 3 Operating Systems File Systems Management D 4 5 Operating Systems Reliability Fault tolerance D 4 6 Operating Systems Security and Protection Access controls authentication cryptographic controls D 4 8 Operating Systems Performance Measurements General Terms Security Reliability Algorithms Performance Measurement Additional Key Words and Phrases Byzantine fault tolerance state machine replication proactive recovery asynchronous systems state transfer This research was partially supported by DARPA under contract F30602 98 1 0237 monitored by the Air Force Research Laboratory Part of this work was done while M Castro was with the MIT Laboratory for Computer Science and during this time he was partially supported by Praxis XXI and Gulbenkian fellowships Authors addresses M Castro Microsoft Research 7 J J Thomson Avenue Cambridge CB3 0FB UK email mcastro microsoft com B Liskov MIT Laboratory for Computer Science 545 Technology Square Cambridge MA 02139 Permission to make digital hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage the copyright notice the title of the publication and its date appear and notice is given that copying is by permission of the ACM Inc To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee C 2002 ACM 0734 2071 02 1100 0398 5 00 ACM Transactions on Computer Systems Vol 20 No 4 November 2002 Pages 398 461 Practical Byzantine Fault Tolerance and Proactive Recovery 399 1 INTRODUCTION We are increasingly dependent on services provided by computer systems and our vulnerability to computer failures is growing as a result We would like these systems to be highly available they should work correctly and they should provide service without interruptions There is a large body of research on replication techniques to implement highly available systems The problem is that most research on replication has focused on techniques that tolerate benign faults e g Alsberg and Day 1976 Gifford 1979 Oki and Liskov 1988 Lamport 1989 and Liskov et al 1991 these techniques assume components fail by stopping or by omitting some steps They may not provide correct service if a single faulty component violates this assumption Unfortunately this assumption is not valid because malicious attacks operator mistakes and software errors are common causes of failure and they can cause faulty nodes to exhibit arbitrary behavior that is Byzantine faults The growing reliance of industry and government on computer systems provides the motif for malicious attacks and the increased connectivity to the Internet exposes these systems to more attacks Operator mistakes are also cited as one of the main causes of failure Murphy and Levidow 2000 In addition the number of software errors is increasing due to the growth in size and complexity of software Techniques that tolerate Byzantine faults Pease et al 1980 Lamport et al 1982 provide a potential solution to this problem because they make no assumptions about the behavior of faulty processes There is a significant body of work on agreement and replication techniques that tolerate Byzantine faults However most earlier work e g Canetti and Rabin 1992 Reiter 1996 Malkhi and Reiter 1996b Garay and Moses 1998 and Khilstrom et al 1998 either concerns techniques that are too inefficient to be used in practice or relies on assumptions that can be invalidated easily by an attacker For example it is dangerous to rely on synchrony Lamport 1984 for safety in the Internet that is to rely on bounds on message delays and process speeds An attacker may compromise the correctness of a service by delaying nonfaulty nodes or the communication between them until the bounds are exceeded Such a denial of service attack is generally easier than gaining control over a nonfaulty node This article describes BFT a new algorithm for state machine replication Lamport 1978 Schneider 1990 that offers both liveness and safety provided at most b n 1 3c out of a total of n replicas are faulty This means that clients eventually receive replies to their requests and those replies are correct according to linearizability Herlihy and Wing 1987 Castro and Liskov 1999a BFT is the first Byzantine fault tolerant state machine replication algorithm that is safe in asynchronous systems such as the Internet it does not rely on any synchrony assumption to provide safety In particular it never returns bad replies even in the presence of denial of service attacks Additionally it guarantees liveness
View Full Document
Unlocking...