CSE 8343 Presentation 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry 1 Outline Terminology Goals of Fault Tolerance Fault Prevention Vs Fault Tolernance Phases in Fault Tolerance Causes of Faults Fault Classification Fault tolerance in Distributed Systems Recovering a Consistent State Checkpoint Rollback Recovery 2 Terminology Fault Physical Defect Error Manifestation of fault Failure Incorrect functioning of the system Fault Tolerance Provide the service despite the presence of faults in the system Fault Tolerant System Mask the presence of faults in the system by using redundancy 3 Goals of Fault Tolerance Dependability Trustworthiness of a computer system Attributes of Dependability Reliability Used when momentary periods of incorrect operation is unacceptable Availability Safety different from reliability Security 4 Fault Prevention Vs Fault Tolernance Fault Avoidance Assumes system failure will occasionally occur No redundancy in the system to mask failures Systems fail when the component s fail Manual maintenance Fault Tolerance Assumes fault prevention techniques will never be able to eliminate all possible faults Redundancy Fault detection Recovery 5 Phases in Fault Tolerance Error Detection Damage confinement Error Recovery Forward Recovery Backward Recovery Fault Treatment continued Service 6 Causes of Faults Physical defects Wear and tear External intervention User errors 7 Fault Classification Permanent fault Permanent Error Incorrect Design Unstable or marginal components Unstable Environment Intermittent Error System Failure Transient Error Operator Mistake 8 FT in Distributed Systems Failures and Fault Classification Crash fault Component halts or loses internal state Will not go through correct state transition Omission fault Will not respond to some inputs Timing fault Makes it slower or faster performance fault 9 Fault Classes Cont d Byzantine fault Behaves in an arbitrary way Incorrect computation fault Crash Omi Timing Byzantine 10 FT Building Blocks Byzantine agreement Synchronized clocks Stable storage Fail stop processors Detection and diagnosis Reliable messaging 11 Byzantine agreement Transmitter 1 1 Node i 0 Node j Node j is faulty Transmitter 1 0 Node i 0 Node j Transmitter is faulty 12 IC Protocol With Ordinary Messages Assumptions All messages delivered correctly Receiver knows the sender Absence of a message can be detected Algorithm runs in various rounds 13 ICA Algorithm ICA 0 Transmitter sends the value to other n 1 nodes Each node uses the received value or the default value in case of no reception ICA m m 0 Transmitter sends value to other nodes Node I runs ICA m 1 to send Vi to other n 2 nodes Node I uses the value majority v1 v2 vn 1 14 Protocol with Signed Messages Algorithm SM m Initialize Vi 0 The transmitter sends the signed value to all other nodes For each I Receives message v 0 from transmitter sets Vi to v and sends message v 0 I to every other node If node I receives the message v 0 j1 j2 jk and v not in Vi add v to Vi if k m sends the message v 0 j1 j2 jk I to every node other than j1 j2 jk When the messages are over final value is choice Vi 15 Synchronized Clocks Internal synchronization External synchronization Drift of physical clocks Value of all the non faulty processors clocks must be approximately equal Change of the non faulty processors clocks during resynchronization should be minimal Deterministic and probabilistic clock synchronization 16 Stable Storage Operations write address data read address returns status data Failures Transient failures Bad sector Controller failure Disk failure 17 Implementation Using only one disk Careful read repeated read until it returns status good Careful write write followed by a careful read Will not cover decay events and crashes during write Partition disk into ordered pairs of pages that are not decay related 18 Disk Shadowing CPU2 CPU1 Disk Controller Disk 1 Disk 2 Disk Controller 19 Redundant Arrays of Disk Files are striped across multiple spindles Redundancy yields high data availability Capacity penalty to store it Bandwidth penalty to update Mirroring Shadowing high capacity cost Techniques Horizontal Hamming Codes overkill Parity Reed Solomon Codes 20 Fail Stop Processors Fail Stop Behavior After a failure Stops executing Internal state including the volatile memory lost Any processor can detect the failure Impossible to implement with just one processor k fail stop implementation 21 Reliable Broadcast Reliable Atomic Casual Using Message forwarding Using Piggybacked Acks 22 CheckPoint What is Checkpointing Saved local states of a system is called checkpoint Process of saving the checkpoints on a stable storage is called checkpointing Need for Checkpointing Checkpointing is used to bring a system to consistent state after failures Rollback Recovery 23 CheckPoint cont Simplifies the task of determining actions of transactions that need to be undone or redone when a failure occurs A checkpoint record contains a list of active transactions Steps Write a begin checkpoint record into the log Collect the checkpoint data into the stable storage Write an end checkpoint into the log 24 Classification of Checkpoint Algorithms Uncoordinated Checkpointing Asynchronous Each process may take a checkpoint when it is most convenient Coordinated Checkpointing Synchronous When a process takes checkpoints it asks all relevant processes to take checkpoints 25 Uncoordinated Checkpointing Algorithms Allows each process maximum autonomy in deciding when to take a checkpoint Disadvantages Possibility of Domino effect Process may take a useless checkpoint Forces each process to maintain multiple checkpoints 26 Coordinated Checkpointing Algorithm Advantages Not susceptible to domino effect Maintains only one permanent checkpoint Disadvantages Large latency involved in committing output Suffers from high overhead 27 Domino Effect An erroneous syntax rollbacks to previous checkpoint A process that received a message from the recovering process after the rollback point also needs to be rollback When p2 goes to C then p1 needs to go to B then p0 needs to go to A then p2 needs to go to Checkpoints processors P0 A P1 B C P2 28 Rollback Recovery Rollback Recovery If there is an error in process then all other dependent processes are rolled back to a consistent state and restarted 29 Issues in Rollback Recovery Minimize the extent of Rollback Minimize the number of processes rolling
View Full Document