DOC PREVIEW
U of I CS 425 - Lecture 4

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010Your new datacenterTo build a failure detectorTwo Different System ModelsFailure ModelWhat’s a failure detector?Slide 7Slide 8I. Ping-Ack ProtocolII. Heartbeating ProtocolIn a Synchronous SystemFailure Detector PropertiesSatisfying both Completeness and Accuracy in Asynchronous SystemsCompleteness or Accuracy? (in asynchronous system)Failure Detection in a Distributed SystemCentralized HeartbeatingRing HeartbeatingAll-to-All HeartbeatingEfficiency of Failure Detector: MetricsAccuracy metricsOther Types of FailuresProcesses and ChannelsSlide 23Slide 24Omission and Arbitrary FailuresSummaryNext WeekLecture 4-1Lecture 4-1Computer Science 425Distributed SystemsCS 425 / CSE 424 / ECE 428Fall 2010Computer Science 425Distributed SystemsCS 425 / CSE 424 / ECE 428Fall 2010Indranil Gupta (Indy)September 2, 2010Lecture 4Reading: Sections 12.1 and part of 2.3.2 2010, I. GuptaLecture 4-2Lecture 4-2Your new datacenterYour new datacenter•You’ve been put in charge of a datacenter, and your manager has told you, “Oh no! We don’t have any failures in our datacenter!”•Do you believe him/her?•What would be your first responsibility?•Build a failure detector•What are some things that could go wrong if you didn’t do this?Lecture 4-3Lecture 4-3To build a failure detectorTo build a failure detector•You have a few options1. Hire 1000 people, each to monitor one machine in the datacenter and report to you when it fails.2. Write a failure detector program (distributed) that automatically detects failures and reports to your workstation.Which is more preferable, and why?Lecture 4-4Lecture 4-4Two Different System ModelsTwo Different System ModelsWhenever someone gives you a distributed computing problem, the first question you want to ask is, “What is the model under which I need to solve the problem?” Synchronous Distributed System Each message is received within bounded time Each step in a process takes lb < time < ub (Each local clock’s drift has a known bound)Examples: Multiprocessor systemsAsynchronous Distributed System No bounds on message transmission delays No bounds on process execution (The drift of a clock is arbitrary)Examples: Internet, wireless networks, datacenters, most real systemsLecture 4-5Lecture 4-5Failure Model Failure Model Process omission failure Crash-stop (fail-stop) – a process halts and does not execute any further operations Crash-recovery – a process halts, but then recovers (reboots) after a while We will focus on Crash-stop failures They are easy to detect in synchronous systemsNot so easy in asynchronous systemsLecture 4-6Lecture 4-6What’s a failure detector?What’s a failure detector?pi pjLecture 4-7Lecture 4-7What’s a failure detector?What’s a failure detector?pi pjXCrash-stop failure(pj is a failed process)Lecture 4-8Lecture 4-8What’s a failure detector?What’s a failure detector?pi pjXneeds to know about pj’s failure(pi is a non-faulty process or alive process)There are two main flavors of Failure Detectors… Crash-stop failure(pj is a failed process)Lecture 4-9Lecture 4-9I. Ping-Ack ProtocolI. Ping-Ack Protocolpi pjneeds to know about pj’s failure- pj replies- pi queries pj once every T time units- if pj does not respond within another T time units of being sent the ping, pi detects pj as failedpingackIf pj fails, then within T time units, pi will sendit a ping message. pi will time out within another T time units. Worst case Detection time = 2TThe waiting time ‘T’ can be parameterized.Lecture 4-10Lecture 4-10II. Heartbeating ProtocolII. Heartbeating Protocolpi pjneeds to know about pj’s failure- pj maintains a sequence number- pj sends pi a heartbeat with incremented seq. number after every T time units-if pi has not received a new heartbeat for the past, say 3*T time units, since it received the last heartbeat, then pi detects pj as failed`heartbeatIf T >> round trip time of messages, then worst case detection time ~ 3*T (why?)The ‘3’ can be changed to any positive number since it is a parameterLecture 4-11Lecture 4-11In a Synchronous SystemIn a Synchronous System•The Ping-ack and Heartbeat failure detectors are always correct–Ping-ack: set waiting time ‘T’ to be > round—trip time upper bound–Heartbeat: set waiting time ‘3*T’ to be > round—trip time upper bound•The following property is guaranteed:–If a process pj fails, then pi will detect its failure as long as pi itself is alive–Its next ack/heartbeat will not be received (within the timeout), and thus pi will detect pj as having failedLecture 4-12Lecture 4-12Failure Detector PropertiesFailure Detector Properties•Completeness = every process failure is eventually detected (no misses)•Accuracy = every detected failure corresponds to a crashed process (no mistakes)•What is a protocol that is 100% complete?•What is a protocol that is 100% accurate?•Completeness and Accuracy –Can both be guaranteed 100% in a synchronous distributed system–Can never be guaranteed simultaneously in an asynchronous distributed systemWhy?Lecture 4-13Lecture 4-13•Impossible because of arbitrary message delays, message losses–If a heartbeat/ack is dropped (or several are dropped) from pj, then pj will be mistakenly detected as failed => inaccurate detection–How large would the T waiting period in ping-ack or 3*T waiting period in heartbeating, need to be to obtain 100% accuracy?–In asynchronous systems, delay/losses on a network link are impossible to distinguish from a faulty process•Heartbeating – satisfies completeness but not accuracy (why?)•Ping-Ack – satisfies completeness but not accuracy (why?)Satisfying both Completeness and Accuracy in Asynchronous SystemsSatisfying both Completeness and Accuracy in Asynchronous SystemsLecture 4-14Lecture 4-14Completeness or Accuracy? (in asynchronous system)Completeness or Accuracy? (in asynchronous system)•Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% Completeness•Plenty of distributed apps designed assuming 100% completeness, e.g., p2p systems–“Err on the side of caution”. –Processes not “stuck” waiting for other processes•But it’s ok to mistakenly detect once in a while since – the victim process need only rejoin as a new process•Both Hearbeating and Ping-ack


View Full Document

U of I CS 425 - Lecture 4

Documents in this Course
Lecture 8

Lecture 8

23 pages

TIPS

TIPS

3 pages

The Grid

The Grid

41 pages

Lecture 4

Lecture 4

20 pages

The Grid

The Grid

41 pages

LECTURE 5

LECTURE 5

25 pages

Multicast

Multicast

23 pages

LECTURE

LECTURE

34 pages

Load more
Download Lecture 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?