SMU CSE 8343 - Fault Tolerance in Distributed Systems

Unformatted text preview:

CSE 8343 Presentation # 2OutlineTerminologyGoals of Fault ToleranceFault Prevention Vs Fault TolernancePhases in Fault ToleranceCauses of FaultsFault ClassificationFT in Distributed SystemsFault Classes (Cont’d)FT Building BlocksByzantine agreementIC Protocol With Ordinary MessagesICAProtocol with Signed Messages Algorithm SM(m)Synchronized ClocksStable StorageImplementationDisk ShadowingRedundant Arrays of DiskFail-Stop ProcessorsReliable BroadcastCheckPointCheckPoint cont…Classification of Checkpoint AlgorithmsUncoordinated Checkpointing AlgorithmsCoordinated Checkpointing AlgorithmDomino EffectRollback RecoveryIssues in Rollback RecoveryRecovery techniquesLogging implementationShadow PagingShadow Paging cont..Goals in designing a Recovery MethodPowerPoint Presentation1CSE 8343Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry2Outline•Terminology•Goals of Fault Tolerance•Fault Prevention Vs Fault Tolernance•Phases in Fault Tolerance•Causes of Faults •Fault Classification•Fault tolerance in Distributed Systems•Recovering a Consistent State•Checkpoint•Rollback Recovery3Terminology•Fault : Physical Defect•Error : Manifestation of fault •Failure :Incorrect functioning of the system•Fault Tolerance : Provide the service despite the presence of faults in the system. •Fault Tolerant System: Mask the presence of faults in the system by using redundancy.4Goals of Fault Tolerance•Dependability: -Trustworthiness of a computer system Attributes of Dependability:- Reliability –Used when momentary periods of incorrect operation is unacceptable - Availability - Safety different from reliability - Security5Fault Prevention Vs Fault Tolernance•Fault Avoidance - Assumes system failure will occasionally occur.- No redundancy in the system to mask failures - Systems fail when the component(s) fail - Manual maintenance •Fault Tolerance -Assumes fault prevention techniques will never be able to eliminate all possible faults - Redundancy - Fault detection- Recovery6Phases in Fault Tolerance•Error Detection•Damage confinement•Error Recovery - Forward Recovery - Backward Recovery• Fault Treatment & continued Service7Causes of Faults•Physical defects•Wear and tear•External intervention•User errors8Fault ClassificationSystem FailureIncorrect Design Unstable or marginal componentsUnstable EnvironmentPermanent faultOperator Mistake Permanent Error Transient Error Intermittent Error9FT in Distributed Systems •Failures and Fault Classification•Crash fault–Component halts or loses internal state–Will not go through correct state-transition•Omission fault–Will not respond to some inputs•Timing fault–Makes it slower or faster – performance fault10Fault Classes (Cont’d)•Byzantine fault–Behaves in an arbitrary way•Incorrect computation fault Byzantine Timing OmiCrash11FT Building Blocks•Byzantine agreement•Synchronized clocks•Stable storage•Fail-stop processors•Detection and diagnosis•Reliable messaging12Byzantine agreement1 1 1 0 0 0Node j is faulty Transmitter is faultyTransmitter TransmitterNode iNode iNode jNode j13IC ProtocolWith Ordinary Messages•Assumptions: - All messages delivered correctly - Receiver knows the sender - Absence of a message can be detected• Algorithm runs in various rounds14ICA•Algorithm ICA(0) - Transmitter sends the value to other (n-1)nodes - Each node uses the received value or the default value ( in case of no reception)•ICA(m), m>0: - Transmitter sends value to other nodes - Node I runs ICA(m-1) to send Vi to other (n-2)nodes. - Node I uses the value majority (v1, v2,………,vn-1)15Protocol with Signed MessagesAlgorithm SM(m)•Initialize Vi=0•The transmitter sends the signed value to all other nodes•For each I:–Receives message v:0 from transmitter, sets Vi to {v} and sends message v:0:I to every other node–If node I receives the message v:0:j1:j2:…:jk and v not in Vi, add v to Vi, if k<m, sends the message v:0:j1:j2:…:jk:I to every node other than j1, j2, …, jk.–When the messages are over, final value is choice(Vi)16Synchronized Clocks•Internal synchronization•External synchronization•Drift of physical clocks–Value of all the non-faulty processors' clocks must be approximately equal–Change of the non-faulty processors' clocks during resynchronization should be minimal•Deterministic and probabilistic clock synchronization17Stable Storage•Operations: –write (address, data);–read (address), returns (status, data)•Failures:–Transient failures–Bad sector–Controller failure–Disk failure18Implementation•Using only one disk–Careful read - repeated read until it returns status good–Careful write - write followed by a careful read–Will not cover decay events and crashes during write•Partition disk into ordered pairs of pages that are not decay related19Disk ShadowingCPU1CPU2Disk ControllerDisk ControllerDisk 2Disk 120Redundant Arrays of Disk•Files are "striped" across multiple spindles•Redundancy yields high data availability Mirroring/Shadowing (high capacity cost)Techniques: Horizontal Hamming Codes (overkill) Parity & Reed-Solomon CodesCapacity penalty to store itBandwidth penalty to update21Fail-Stop Processors Fail Stop Behavior•After a failure–Stops executing–Internal state including the volatile memory lost–Any processor can detect the failure•Impossible to implement with just one processor•k-fail-stop implementation22Reliable Broadcast•Reliable•Atomic•Casual •Using Message forwarding•Using Piggybacked Acks23CheckPoint•What is Checkpointing ? - Saved local states of a system is called checkpoint. - Process of saving the checkpoints on a stable storage is called checkpointing.• Need for Checkpointing ? - Checkpointing is used to bring a system to consistent state after failures (Rollback Recovery).24CheckPoint cont…•Simplifies the task of determining actions of transactions that need to be undone or redone when a failure occurs.•A checkpoint record contains a list of active transactions.•Steps: - Write a begin checkpoint record into the log. - Collect the checkpoint data into the


View Full Document
Download Fault Tolerance in Distributed Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Fault Tolerance in Distributed Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Fault Tolerance in Distributed Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?