Beyond Availability

Home> Academic Documents> Beyond Availability

DOC PREVIEW

Beyond Availability

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Beyond Availability: Towards a Deeper Understanding ofMachine Failure Characteristics in Large Distributed SystemsPraveen Yalagandula§, Suman Nath∗, Haifeng Yu†∗, Phillip B. Gibbons†, Srinivasan Seshan∗§The University of Texas at Austin∗Carnegie Mellon University†Intel Research PittsburghAbstractAlthough many previous research efforts have investi-gated machine failure characteristics in distributed sys-tems, availability research has reached a point whereproperties beyond these initial findings become impor-tant. In this paper, we analyze traces from three largedistributed systems to answer several subtle questions re-garding machine failure characteristics. Based on ourfindings, we derive a set of fundamental principles fordesigning highly available distributed systems. Usingseveral case studies, we further show that our designprinciples can significantly influence the availability de-sign choices in existing systems.1 IntroductionA key challenge in designing large, long running dis-tributed systems is to mask the failures that arise amongthe system components. This challenge becomes moreacute as the number of machines and the population ofusers increase, i.e., precisely when the system becomesmore useful. In order to design systems resilient to themachine failure characteristics in real deployments, it isnecessary to study these characteristics and develop de-sign principles tailored to them.In this paper, we analyze traces [1, 2, 12] from threelarge distributed systems (PlanetLab, Domain Name Sys-tem (DNS), and a collection of over 100 web servers) inorder to characterize machine failures in these systems.Although many previous research efforts [2, 3, 5, 6, 13]have also investigated machine failure characteristics,our study focuses on important properties beyond theseinitial findings, and suggests how these properties maystrongly influence the design of large distributed sys-tems. In particular, we start by addressing the followingimportant—but perhaps subtle—questions not answeredby previous studies:• Is high machine availability synonymous with highmachine MTTF (mean time to failure) and low ma-chine MTTR (mean time to repair)? That is, in realdeployments does a machine that is up a high frac-tion of the time (high availability) tend to have bothhigh MTTF and low MTTR?• Given a machine in real deployments, can we pre-dict its MTTF and MTTR with reasonable accuracybased on its history? Moreover, given a machine,can we predict a crash event or a recovery event withreasonable accuracy based on its history? In otherwords, can we (also) predict TTF (time to failure)and TTR (time to repair)?• What is the level of correlation among machine fail-ures in a large-scale distributed system? Is the levelsufficiently high that we should take it into accountin system design?The answers to these questions can significantly in-fluence the design of large distributed systems targetinghigh system availability. For example, both Total Re-call [4] and CFS [7] determine the degree of replicationassuming failure independence. If the correlation levelis high, such designs need to be revisited. As anotherexample, consider End System Multicast (ESM) [9], anoverlay multicast system that utilizes well-provisionedinfrastructure nodes (called waypoints) to construct bet-ter multicast trees. The failure of a waypoint will causetemporary interruption of the service followed by the re-pair of the multicast tree with new waypoints. Clearly,the system availability of ESM is affected by the MTTFrather than the availability of the machines selected aswaypoints. On the other hand, there are other systems(see Section 3) that care more about the MTTR of its ma-chines. If high availability of a machine does not alwaysimply good MTTF (or MTTR), then a system should notsimply favor the use of machines with high availability.Based on the findings from our study of the afore-mentioned questions, we derive four fundamental designprinciples for highly available distributed systems:P1. Good availability does not necessarily imply goodMTTF and MTTR, and, thus, distributed systemsshould monitor these aspects separately and usethem accordingly.P2. A machine’s MTTF and MTTR can be predictedwith reasonable accuracy and can, thus, be utilizedin a design.P3. A design should not expect to be able to predictthe individual failure and recovery events (TTF andTTR) with reasonable accuracy based on the currentuptime, downtime, MTTF or MTTR.P4. Large-scale correlated failures are common in realsystems and they significantly hurt the availabilityof traditional fault tolerance mechanisms that as-sume independent failures. As a result, correlationshould be taken into account in the design of highlyavailable distributed systems.Using case studies, we further show that our designprinciples may significantly influence availability de-sign choices in systems such as CFS [7], Om [17],RAMBO [10], ESM [9], and Majority Voting [14]. Weare also fully aware that the set of questions answered inthis paper is exemplary rather than exhaustive; our hopeis that our paper will motivate additional research in thisarea.2 Findings and Implications2.1 MethodologyOur study is based on traces from three large distributedsystems: PlanetLab, DNS, and a collection of over100 web servers. We call the three tracesPL trace,DNS trace, andWS tracerespectively. The DNS andweb server traces are intended to be representative ofpublic-access machines that are maintained by differ-ent administrative domains, while thePL tracepoten-tially describes the behavior of a centrally administereddistributed system that is used mainly for research pur-poses. There are also many other failure traces avail-able, such as for P2P systems [3, 13] and for campus-wide networks [5]. In this paper, we intentionally focuson machine failure characteristics in non-P2P wide-areadistributed systems.All three traces are probe traces rather than nodeup/down logs. The nature of these probe traces requiresus to carefully address the effects of network failures thatdemonstrate themselves as node failures. Also, we areunable to detect short-duration failures or recoveries be-tween probes. On the other hand, using probe traces en-able us to study public/commercial systems that typicallydo not publish up/down logs.PL trace[1] contains probes between all pairs ofnodes (277 on average) in PlanetLab from March 2003to June 2004. Each probe consists of 10 pings, and wesay that


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Please select your school