DOC PREVIEW
MTU CS 6461 - FIGHTING FIRE WITH FIRE

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1 INTRODUCTION2 SCALABILITY AND RELIABILITY2.1 Throughput instability2.2 Micropartitions2.3 Convoys2.4 Request and retransmission storms3 BIMODAL MULTICAST4 FINDINGS SO FAR AND PROVIDING STRONGER GUARANTEES4.1 General insight4.2 Our solution---a non-traditional stack architecture5 SCALING VIRTUAL SYNCHRONY5.1 Scaling virtual synchrony---a back-of-the-envelope analysis5.1.1 Virtual synchrony, unicast implementation.5.1.2 Virtual synchrony, implemented over hbox Pbcast.5.2 Algorithm description6 EXPERIMENTAL EVALUATION6.1 Scalability6.2 Insensitivity to perturbations6.3 Laziness of the new non-traditional stack6.4 Message load on members6.5 Effect of larger leader committees6.6 Summary of experimental results7 SUMMARY AND CONCLUSIONSQUALITY AND RELIABILITY ENGINEERING INTERNATIONALQual. Reliab. Engng. Int. 2002; 18: 165–184 (DOI: 10.1002/qre.473)FIGHTING FIRE WITH FIRE: USING RANDOMIZED GOSSIPTO COMBAT STOCHASTIC SCALABILITY LIMITSINDRANIL GUPTA, KENNETH P. BIRMAN∗AND ROBBERT VAN RENESSECornell University, Ithaca, NY 14853, USASUMMARYThe mechanisms used to improve the reliability of distributed systems often limit performance and scalability.Focusing on one widely-used definition of reliability, we explore the origins of this phenomenon and conclude thatit reflects a tradeoff arising deep within the typical protocol stack. Specifically, we suggest that protocol designsoften disregard the high cost of infrequent events. When a distributed system is scaled, both the frequency andthe overall cost of such events often grow with the size of the system. This triggers an O(n2) phenomenon, whichbecomes visible above some threshold sizes. Our findings suggest that it would be more effective to constructlarge-scale reliable systems where, unlike traditional protocol stacks, lower layers use randomized mechanisms,with probabilistic guarantees, to overcome low-probability events. Reliability and other end-to-end properties areintroduced closer to the application. We employ a back-of-the-envelope analysis to quantify this phenomenonfor a class of strongly reliable multicast problems. We construct a non-traditional stack, as described above,that implements virtually synchronous multicast. Experimental results reveal that virtual synchrony over a non-traditional, probabilistic stack helps break through the scalability barrier faced by traditional implementations ofthe protocol. Copyright 2002 John Wiley & Sons, Ltd.KEY WORDS: group communication; reliable multicast; scalability; randomized algorithms; non-traditionalstack; virtual synchrony1. INTRODUCTIONThe scalability of distributed protocols and systems isa major determinant of success in demanding settings.This paper focuses on the scalability of distributedprotocols, in group communication systems, providingsome form of guaranteed reliability. Examplesinclude the virtual synchrony protocols for reliablegroup communication [1], scalable reliable multicast(SRM) [2], and reliable multicast transport protocol(RMTP) [3]. We argue that the usual architecture forsupporting reliability exposes mechanisms of this sortto serious scalability problems.Reliability can be understood as a point within a∗Correspondence to: K. P. Birman, Department of ComputerScience, Cornell University, 4130 Upson Hall, Ithaca, NY 14853,USA. Email: [email protected]/grant sponsor: DARPA/AFRL-IFGA; contract/grantnumber: F30602-99-1-0532Contract/grant sponsor: NSF-CISE; contract/grant number:9703470Contract/grant sponsor: AFRL-IFGA Information AssuranceInstituteContract/grant sponsor: Microsoft ResearchContract/grant sponsor: Intel Corporationspectrum of possible guarantees. At one extreme ofthis spectrum one finds very costly, very strong guar-antees (for example, the Byzantine Agreement usedto support Replicated State Machines). Replicateddatabase systems represent a step in the directionof weaker, cheaper goals: one-copy serializability,typically implemented by some form of multiphasecommit using a quorum read and write mechanism.Lamport’s Paxos architecture and the consensus-basedapproach of Chandra and Toueg achieve similar prop-erties. Virtual synchrony, the model upon which wefocus here, is perhaps the weakest and cheapest ap-proach that can still be rigorously specified, modeledand reasoned about. Beyond this point in the spec-trum, one finds best-effort reliability models, lackingrigorous semantics, but more typical of the modernInternet—SRM and RMTP are of this nature. Theusual assumption is that, being the weakest reliabilityproperties, they should also be the cheapest to achieveand most scalable. One surprising finding of our workis that this usual belief is incorrect.Traditionally, one discusses reliability by posinga problem in a setting exposed to some class offaults. Fault-tolerant protocols solving the problemCopyright 2002 John Wiley & Sons, Ltd.166 I.GUPTA,K.P.BIRMANANDR.VANRENESSEcan then be compared. The protocols cited aboveprovide multicast tolerance of message loss andendpoint failures, and discussions of their behaviorwould typically look at throughput and latency undernormal conditions, message complexity, backgroundoverheads, and at the degree to which failures disruptthese properties.Oddly, even thorough performance analyses gener-ally focus on the extremes—performance of the proto-col under ideal conditions (when nothing goes wrong),and performance impact when injected failures disruptexecution. This paper adopts a different perspective;it looks at reliable protocols under the influence ofwhat might be called mundane transient problems,such as network or processor scheduling delays andbrief periods of packet loss. One would expect reliableprotocols to ride out such events, but we find that thisis rarely the case, particularly if we look at the impactof a disruptive event as a function of scale. In fact,reliable protocols degrade dramatically under this typeof mundane stress, a phenomenon attributable to low-probability events that become both more frequent andmore costly as the scale of the system grows.After presenting these arguments, we shift attentionto an old class of protocols that resemble NNTP, thegossip-based algorithm used to propagate ‘news’ onthe Internet. These turn out to be scalable under thesame style of analysis that predicts poor scalability fortheir non-gossip counterparts.This finding leads us to suggest that protocoldesigners should fight fire with fire, employingprobabilistic techniques that


View Full Document

MTU CS 6461 - FIGHTING FIRE WITH FIRE

Documents in this Course
Tapestry

Tapestry

13 pages

Load more
Download FIGHTING FIRE WITH FIRE
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view FIGHTING FIRE WITH FIRE and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view FIGHTING FIRE WITH FIRE 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?