U of I CS 425 - Failure Detectors - D2555137

Home> Schools> University of Illinois> Computer Science (CS) > CS 425> Failure Detectors

DOC PREVIEW

U of I CS 425 - Failure Detectors

School name University of Illinois

Course Cs 425- Distributed Systems

Pages 25

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Two Different System ModelsFailure ModelWhat’s a failure detector?Slide 5Slide 6I. Ping-Ack ProtocolII. Heartbeating ProtocolFailure Detector PropertiesCompleteness or Accuracy?Slide 11Failure Detection in a Distributed SystemSlide 13Centralized HeartbeatingRing HeartbeatingAll-to-All HeartbeatingEfficiency of Failure Detector: MetricsAccuracy metricsMore Accuracy metricsProcesses and ChannelsOther Failure TypesSlide 22Omission and Arbitrary FailuresSummaryNext…Indranil GuptaLecture 8Failure Detectors(Sections 12.1 and part of 2.3.2)Computer Science 425Distributed SystemsComputer Science 425Distributed SystemsTwo Different System Models Synchronous Distributed System Each message is received within bounded time Each step in a process takes lb < time < ub (Each local clock’s drift has a known bound)Asynchronous Distributed System No bounds on process execution No bounds on message transmission delays (The drift of a clock is arbitrary)The Internet is an asynchronous distributed systemFailure Model Process omission failure Crash-stop (fail-stop) – a process halts and does not execute any further operations Crash-recovery – a process halts, but then recovers (reboots) after a while Crash-stop failures can be detected in synchronous systemsNext: detecting crash-stop failures in asynchronous systemsWhat’s a failure detector?pi pjWhat’s a failure detector?pi pjXCrash-stop failureWhat’s a failure detector?pi pjXCrash-stop failureneeds to know about pj’s failureI. Ping-Ack Protocolpi pjneeds to know about pj’s failure- pj replies- pi queries pj once every T time units- if pj does not respond within T time units, pi marks pj as failedpingackIf pj fails, within T time units, pi will sendit a ping message, and will time out within another T time units. Detection time = 2TII. Heartbeating Protocolpi pjneeds to know about pj’s failure- pj maintains a sequence number- pj sends pi a heartbeat with incremented seq. number after every T’(=T) time units-if pi has not received a new heartbeat for the past T time units, pi declares pj as failedheartbeatIf pj has sent x heartbeats until the time it fails, then pi will timeout within (x+1)*T time units in the worst case, and will detect pj as failed.In reality, detection timeis also T time units (why?)Failure Detector Properties•Completeness = every process failure is eventually detected (no misses)•Accuracy = every detected failure corresponds to a crashed process (no mistakes)•Given a failure detector that satisfies both Completeness and Accuracy–One can show that Consensus is achievable–FLP => one cannot design a failure detector (for an asynchronous system) that guarantees both above propertiesCompleteness or Accuracy?•Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% Completeness•Plenty of distributed apps designed assuming 100% completeness, e.g., p2p systems–“Err on the side of caution”. –Other processes need to make repairs whenever a failure happens•Heartbeating – satisfies completeness but not accuracy (why?)•Ping-Ack – satisfies completeness but not accuracy (why?)Completeness or Accuracy?•Both Hearbeating and Ping-ack provide–Probabilistic accuracy (for a process detected as failed, with some probability close to 1.0, it is true that it has actually crashed).•That was for asynchronous systems•Heartbeating and Ping-ack can satisfy both completeness and accuracy in synchronous systems (why?)Failure Detection in a Distributed System•Difference from original failure detection is–we want not one process (pi), but all processes in system to know about failure May need to combine failure detection with a dissemination protocol–What’s an example of a dissemination protocol?Failure Detection in a Distributed System•Difference from original failure detection is–we want not one process (pi), but all processes in system to know about failure May need to combine failure detection with a dissemination protocol–What’s an example of a dissemination protocol?•A reliable multicast protocol!Centralized Heartbeating…pj, Heartbeat Seq. l++ pjpiNeeds a separate dissemination componentDownside?Ring Heartbeatingpj, Heartbeat Seq. l++pj……piNeeds a separate dissemination componentDownside?All-to-All Heartbeatingpj, Heartbeat Seq. l++…pjpiDoes not need a separate dissemination componentDownside?Efficiency of Failure Detector: Metrics•Measuring Speed: Detection Time–Time between a process crash and its detection–Determines speed of failure detector•Measuring Accuracy: depends on distributed applicationAccuracy metrics•Tmr: Mistake recurrence time–Time between two consecutive mistakes •Tm: Mistake duration time–Length of time for which correct process is marked as failed (for crash-recovery model)pjpi’s view of pjpj is uppj is downupTmTmrMore Accuracy metricsOther possible metrics:•Number of false failure detections per time unit•Fraction of failure detections that are falseProcesses and Channelsprocess pprocess qCommunication channelsendOutgoing message buffer Incoming message bufferreceivemCommunication omission failures Send-omission: loss of messages between the sending process and the outgoing message buffer (both inclusive)What might cause this? Channel omission: loss of message in the communication channel.What might cause this? Receive-omission: loss of messages between the incoming message buffer and the receiving process (both inclusive)What might cause this?Other Failure TypesArbitrary failuresArbitrary process failure: arbitrarily omits intended processing steps or takes unintended processing steps.Arbitrary channel failures: messages may be corrupted, duplicated, delivered out of order, incur extremely large delays; or non-existent messages may be delivered.Above two are Byzantine failures, e.g., due to hackers, man-in-the-middle attacks, viruses, worms, etc.A variety of Byzantine fault-tolerant protocols have been designed in literature!Other Failure TypesOmission and Arbitrary FailuresClass of failure Affects DescriptionFail-stopor Crash-stopProcess Process halts and remains halted. Other processes maydetect this state.Omission ChannelA message inserted in an outgoing message buffer neverarrives at the other end’s incoming message buffer.Send-omission Process A process completes a send, but the

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-24-25 out of 25 pages.

U of I CS 425 - Failure Detectors

Sign up for free to view:

Please select your school