View Full Document

12 views

Unformatted text preview:

Beyond Availability Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems Praveen Yalagandula Suman Nath Haifeng Yu Phillip B Gibbons Srinivasan Seshan The University of Texas at Austin Carnegie Mellon University Intel Research Pittsburgh Abstract Although many previous research efforts have investigated machine failure characteristics in distributed systems availability research has reached a point where properties beyond these initial findings become important In this paper we analyze traces from three large distributed systems to answer several subtle questions regarding machine failure characteristics Based on our findings we derive a set of fundamental principles for designing highly available distributed systems Using several case studies we further show that our design principles can significantly influence the availability design choices in existing systems 1 Introduction A key challenge in designing large long running distributed systems is to mask the failures that arise among the system components This challenge becomes more acute as the number of machines and the population of users increase i e precisely when the system becomes more useful In order to design systems resilient to the machine failure characteristics in real deployments it is necessary to study these characteristics and develop design principles tailored to them In this paper we analyze traces 1 2 12 from three large distributed systems PlanetLab Domain Name System DNS and a collection of over 100 web servers in order to characterize machine failures in these systems Although many previous research efforts 2 3 5 6 13 have also investigated machine failure characteristics our study focuses on important properties beyond these initial findings and suggests how these properties may strongly influence the design of large distributed systems In particular we start by addressing the following important but perhaps subtle questions not answered by previous studies Is high machine availability synonymous with high machine MTTF mean time to failure and low machine MTTR mean time to repair That is in real deployments does a machine that is up a high fraction of the time high availability tend to have both high MTTF and low MTTR Given a machine in real deployments can we predict its MTTF and MTTR with reasonable accuracy based on its history Moreover given a machine can we predict a crash event or a recovery event with reasonable accuracy based on its history In other words can we also predict TTF time to failure and TTR time to repair What is the level of correlation among machine failures in a large scale distributed system Is the level sufficiently high that we should take it into account in system design The answers to these questions can significantly influence the design of large distributed systems targeting high system availability For example both Total Recall 4 and CFS 7 determine the degree of replication assuming failure independence If the correlation level is high such designs need to be revisited As another example consider End System Multicast ESM 9 an overlay multicast system that utilizes well provisioned infrastructure nodes called waypoints to construct better multicast trees The failure of a waypoint will cause temporary interruption of the service followed by the repair of the multicast tree with new waypoints Clearly the system availability of ESM is affected by the MTTF rather than the availability of the machines selected as waypoints On the other hand there are other systems see Section 3 that care more about the MTTR of its machines If high availability of a machine does not always imply good MTTF or MTTR then a system should not simply favor the use of machines with high availability Based on the findings from our study of the aforementioned questions we derive four fundamental design principles for highly available distributed systems P1 Good availability does not necessarily imply good MTTF and MTTR and thus distributed systems should monitor these aspects separately and use them accordingly P2 A machine s MTTF and MTTR can be predicted with reasonable accuracy and can thus be utilized in a design P3 A design should not expect to be able to predict the individual failure and recovery events TTF and TTR with reasonable accuracy based on the current uptime downtime MTTF or MTTR P4 Large scale correlated failures are common in real systems and they significantly hurt the availability of traditional fault tolerance mechanisms that assume independent failures As a result correlation should be taken into account in the design of highly available distributed systems Using case studies we further show that our design principles may significantly influence availability design choices in systems such as CFS 7 Om 17 RAMBO 10 ESM 9 and Majority Voting 14 We are also fully aware that the set of questions answered in this paper is exemplary rather than exhaustive our hope is that our paper will motivate additional research in this area 2 Findings and Implications 2 1 Methodology Our study is based on traces from three large distributed systems PlanetLab DNS and a collection of over 100 web servers We call the three traces PL trace DNS trace and WS trace respectively The DNS and web server traces are intended to be representative of public access machines that are maintained by different administrative domains while the PL trace potentially describes the behavior of a centrally administered distributed system that is used mainly for research purposes There are also many other failure traces available such as for P2P systems 3 13 and for campuswide networks 5 In this paper we intentionally focus on machine failure characteristics in non P2P wide area distributed systems All three traces are probe traces rather than node up down logs The nature of these probe traces requires us to carefully address the effects of network failures that demonstrate themselves as node failures Also we are unable to detect short duration failures or recoveries between probes On the other hand using probe traces enable us to study public commercial systems that typically do not publish up down logs PL trace 1 contains probes between all pairs of nodes 277 on average in PlanetLab from March 2003 to June 2004 Each probe consists of 10 pings and we say that a probe fails if and only if all 10 pings fail We refer to a complete cycle of all pair probes as a probe interval which is


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Beyond Availability and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Beyond Availability and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?