U of I CS 425 - Distributed Systems - D1917922

Home> Schools> University of Illinois> Computer Science (CS) > CS 425> Distributed Systems

DOC PREVIEW

U of I CS 425 - Distributed Systems

School name University of Illinois

Course Cs 425- Distributed Systems

Pages 32

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 32 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1AcknowledgementAdministrativePlan for TodayTwo Different System ModelsFailure ModelNetwork partitionWhat’s a failure detector?Slide 9Slide 10I. Ping-Ack ProtocolII. Heart-beating ProtocolFailure Detector PropertiesCompleteness or Accuracy?Slide 15Failure Detection in a Distributed SystemSlide 17Centralized Heart-beatingRing Heart-beatingAll-to-All Heart-beatingEfficiency of Failure Detector: MetricsAccuracy MetricsMore Accuracy MetricsProcesses and ChannelsOther Failure TypesSlide 26Omission and Arbitrary FailuresTiming FailuresSlide 29Masking FailuresSummaryNext…Lecture 11Failure Detectors(Sections 12.1 and part of 2.3.2)Klara NahrstedtCS 425/ECE 428/CSE424Distributed Systems(Fall 2009)CS 425/ECE 428/CSE424Distributed Systems(Fall 2009)Acknowledgement•The slides during this semester are based on ideas and material from the following sources: –Slides prepared by Professors M. Harandi, J. Hou, I. Gupta, N. Vaidya, Y-Ch. Hu, S. Mitra. –Slides from Professor S. Gosh’s course at University o Iowa.Administrative •HW 2 posted September 22, Tuesday–Deadline, October 6 (Tuesday), 2pm (at the beginning of the class)Plan for Today•Failure detectors•Properties – completeness & accuracy •Two failure detector algorithms –Heart-beating and Ping-Ack•Distributed Failure Distribution through heart-beating algorithms– Centralized, Ring, All-to-all•Accuracy metrics•Other Types of FailuresTwo Different System Models Synchronous Distributed System Each message is received within bounded time Each step in a process takes lb < time < ub (Each local clock’s drift has a known bound)Asynchronous Distributed System No bounds on process execution No bounds on message transmission delays (The drift of a clock is arbitrary)The Internet is an asynchronous distributed systemFailure Model Process omission failure Crash-stop (fail-stop) – a process halts and does not execute any further operations Crash-recovery – a process halts, but then recovers (reboots) after a while Crash-stop failures can be detected in synchronous systemsNext: detecting crash-stop failures in asynchronous systemsNetwork partitionCrashedrouterWhat’s a failure detector?pi pjWhat’s a failure detector?pi pjXCrash-stop failureWhat’s a failure detector?pi pjXCrash-stop failureneeds to know about pj’s failureI. Ping-Ack Protocolpi pjneeds to know about pj’s failure- pj replies- pi queries pj once every T time units- if pj does not respond within T time units, pi marks pj as failedpingackIf pj fails, within T time units, pi will sendit a ping message, and will time out within another T time units. Detection time = 2TII. Heart-beating Protocolpi pjneeds to know about pj’s failure- pj maintains a sequence number- pj sends pi a heartbeat with incremented seq. number after every T’(=T) time units-if pi has not received a new heartbeat for the past T time units, pi declares pj as failedheartbeatIf pj has sent x heartbeats until the time it fails, then pi will timeout within (x+1)*T time units in the worst case, and will detect pj as failed.In reality, detection timeis also T time units (why?)Failure Detector Properties•Completeness = every process failure is eventually detected (no misses)•Accuracy = every detected failure corresponds to a crashed process (no mistakes)•Given a failure detector that satisfies both Completeness and Accuracy–One can show that Consensus is achievable–FLP => one cannot design a failure detector (for an asynchronous system) that guarantees both above propertiesCompleteness or Accuracy?•Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% completeness•Plenty of distributed apps designed assuming 100% completeness, e.g., p2p systems–“Err on the side of caution”. –Other processes need to make repairs whenever a failure happens•Heart-beating – satisfies completeness but not accuracy (why?)•Ping-Ack – satisfies completeness but not accuracy (why?)Completeness or Accuracy?•Both Heart-beating and Ping-Ack provide–Probabilistic accuracy (for a process detected as failed, with some probability close to 1.0, it is true that it has actually crashed).–That was for asynchronous systems•Heart-beating and Ping-ack can satisfy both completeness and accuracy in synchronous systems (why?)Failure Detection in a Distributed System•Difference from original failure detection is–we want not one process (pi), but all processes in system to know about failure May need to combine failure detection with a dissemination protocol–What’s an example of a dissemination protocol?Failure Detection in a Distributed System•Difference from original failure detection is–we want not one process (pi), but all processes in system to know about failure May need to combine failure detection with a dissemination protocol–What’s an example of a dissemination protocol?•A reliable multicast protocol!Centralized Heart-beating…pj, Heartbeat Seq. l++ pjpiNeeds a separate dissemination componentDownside?Ring Heart-beatingpj, Heartbeat Seq. l++pj……piNeeds a separate dissemination componentDownside?All-to-All Heart-beatingpj, Heartbeat Seq. l++…pjpiDoes not need a separate dissemination componentDownside?Efficiency of Failure Detector: Metrics•Measuring Speed: Detection Time–Time between a process crash and its detection–Determines speed of failure detector•Measuring Accuracy: depends on distributed applicationAccuracy Metrics•Tmr: Mistake recurrence time–Time between two consecutive mistakes •Tm: Mistake duration time–Length of time for which correct process is marked as failed (for crash-recovery model)pjpi’s view of pjpj is uppj is downupTmTmrMore Accuracy Metrics•Number of false failure detections per time unit (false positives)–System reported failure, but actually the process was up–Failure detector is inaccurate•Number of not detected failures (false negatives) –System did not report failure, but the process failed –Failure detector is incompleteProcesses and Channelsprocess pprocess qCommunication channelsendOutgoing message buffer Incoming message bufferreceivemCommunication Omission Failures Send-omission: loss of messages between the sending process and the outgoing message buffer (both inclusive)What might cause this? Channel omission: loss of message in the communication channel.What might

View Full Document