Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010Your new datacenterTo build a failure detectorTwo Different System ModelsFailure ModelWhat’s a failure detector?Slide 7Slide 8I. Ping-Ack ProtocolII. Heartbeating ProtocolIn a Synchronous SystemFailure Detector PropertiesSatisfying both Completeness and Accuracy in Asynchronous SystemsCompleteness or Accuracy? (in asynchronous system)Failure Detection in a Distributed SystemCentralized HeartbeatingRing HeartbeatingAll-to-All HeartbeatingEfficiency of Failure Detector: MetricsAccuracy metricsOther Types of FailuresProcesses and ChannelsSlide 23Slide 24Omission and Arbitrary FailuresSummaryNext WeekLecture 4-1Lecture 4-1Computer Science 425Distributed SystemsCS 425 / CSE 424 / ECE 428Fall 2010Computer Science 425Distributed SystemsCS 425 / CSE 424 / ECE 428Fall 2010Indranil Gupta (Indy)September 2, 2010Lecture 4Reading: Sections 12.1 and part of 2.3.2 2010, I. GuptaLecture 4-2Lecture 4-2Your new datacenterYour new datacenter•You’ve been put in charge of a datacenter, and your manager has told you, “Oh no! We don’t have any failures in our datacenter!”•Do you believe him/her?•What would be your first responsibility?•Build a failure detector•What are some things that could go wrong if you didn’t do this?Lecture 4-3Lecture 4-3To build a failure detectorTo build a failure detector•You have a few options1. Hire 1000 people, each to monitor one machine in the datacenter and report to you when it fails.2. Write a failure detector program (distributed) that automatically detects failures and reports to your workstation.Which is more preferable, and why?Lecture 4-4Lecture 4-4Two Different System ModelsTwo Different System ModelsWhenever someone gives you a distributed computing problem, the first question you want to ask is, “What is the model under which I need to solve the problem?” Synchronous Distributed System Each message is received within bounded time Each step in a process takes lb < time < ub (Each local clock’s drift has a known bound)Examples: Multiprocessor systemsAsynchronous Distributed System No bounds on message transmission delays No bounds on process execution (The drift of a clock is arbitrary)Examples: Internet, wireless networks, datacenters, most real systemsLecture 4-5Lecture 4-5Failure Model Failure Model Process omission failure Crash-stop (fail-stop) – a process halts and does not execute any further operations Crash-recovery – a process halts, but then recovers (reboots) after a while We will focus on Crash-stop failures They are easy to detect in synchronous systemsNot so easy in asynchronous systemsLecture 4-6Lecture 4-6What’s a failure detector?What’s a failure detector?pi pjLecture 4-7Lecture 4-7What’s a failure detector?What’s a failure detector?pi pjXCrash-stop failure(pj is a failed process)Lecture 4-8Lecture 4-8What’s a failure detector?What’s a failure detector?pi pjXneeds to know about pj’s failure(pi is a non-faulty process or alive process)There are two main flavors of Failure Detectors… Crash-stop failure(pj is a failed process)Lecture 4-9Lecture 4-9I. Ping-Ack ProtocolI. Ping-Ack Protocolpi pjneeds to know about pj’s failure- pj replies- pi queries pj once every T time units- if pj does not respond within another T time units of being sent the ping, pi detects pj as failedpingackIf pj fails, then within T time units, pi will sendit a ping message. pi will time out within another T time units. Worst case Detection time = 2TThe waiting time ‘T’ can be parameterized.Lecture 4-10Lecture 4-10II. Heartbeating ProtocolII. Heartbeating Protocolpi pjneeds to know about pj’s failure- pj maintains a sequence number- pj sends pi a heartbeat with incremented seq. number after every T time units-if pi has not received a new heartbeat for the past, say 3*T time units, since it received the last heartbeat, then pi detects pj as failed`heartbeatIf T >> round trip time of messages, then worst case detection time ~ 3*T (why?)The ‘3’ can be changed to any positive number since it is a parameterLecture 4-11Lecture 4-11In a Synchronous SystemIn a Synchronous System•The Ping-ack and Heartbeat failure detectors are always correct–Ping-ack: set waiting time ‘T’ to be > round—trip time upper bound–Heartbeat: set waiting time ‘3*T’ to be > round—trip time upper bound•The following property is guaranteed:–If a process pj fails, then pi will detect its failure as long as pi itself is alive–Its next ack/heartbeat will not be received (within the timeout), and thus pi will detect pj as having failedLecture 4-12Lecture 4-12Failure Detector PropertiesFailure Detector Properties•Completeness = every process failure is eventually detected (no misses)•Accuracy = every detected failure corresponds to a crashed process (no mistakes)•What is a protocol that is 100% complete?•What is a protocol that is 100% accurate?•Completeness and Accuracy –Can both be guaranteed 100% in a synchronous distributed system–Can never be guaranteed simultaneously in an asynchronous distributed systemWhy?Lecture 4-13Lecture 4-13•Impossible because of arbitrary message delays, message losses–If a heartbeat/ack is dropped (or several are dropped) from pj, then pj will be mistakenly detected as failed => inaccurate detection–How large would the T waiting period in ping-ack or 3*T waiting period in heartbeating, need to be to obtain 100% accuracy?–In asynchronous systems, delay/losses on a network link are impossible to distinguish from a faulty process•Heartbeating – satisfies completeness but not accuracy (why?)•Ping-Ack – satisfies completeness but not accuracy (why?)Satisfying both Completeness and Accuracy in Asynchronous SystemsSatisfying both Completeness and Accuracy in Asynchronous SystemsLecture 4-14Lecture 4-14Completeness or Accuracy? (in asynchronous system)Completeness or Accuracy? (in asynchronous system)•Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% Completeness•Plenty of distributed apps designed assuming 100% completeness, e.g., p2p systems–“Err on the side of caution”. –Processes not “stuck” waiting for other processes•But it’s ok to mistakenly detect once in a while since – the victim process need only rejoin as a new process•Both Hearbeating and Ping-ack
View Full Document