Presentation 2 Group A1 Professor Mohamed Khalil Anita Kanuganti Hemanth Rao Fault Tolerance in Distributed Systems Outline Overview of Fault Tolerance Importance in Distributed systems Types of Faults Measurement of Faults Failure Models Redundancy and Forms of Redundancy Software Fault Tolerance Techniques Reliable Communication Distributed Commit Failure Recovery On going Research References Fault Tolerance The ability of a system to respond gracefully to an unexpected Hardware or Software Failure There are many levels of Fault tolerance the lowest being the ability to continue operation in the event of power failure Importance in Distributed Systems Computer systems are not very reliable OS crashes frequently Windows buggy software unreliable Hardware SW HW incompatibilities Growing popularity of Internet World Wide Web Example what if your TV or car broke down every day Users don t want to restart TV or fix it by opening it up So we need to make our computer systems more Reliable and Dependable Types of Faults Nature Systematic Random Duration Transient Intermittent Permanent Extent Global Local Measurement of Faults Fault Removal Coverage Fault Detection Coverage Fault Tolerance Coverage Failure Models Type of Failure Description Crash Failure Server halts but is working correctly until it halts Omission Failure Server fails to respond to incoming requests Timing Failure Server s response lies outside the specified time interval Response Failure Server s response is incorrect Arbitrary Failure Server may produce arbitrary responses at arbitrary times Redundancy and its Forms Redundancy does same computation for n number of times So if one fails the other will operate Forms of Redundancy Hardware Redundancy Software Redundancy Information Redundancy Temporal time Redundancy Software Fault tolerance Techniques N Version Programming Different implementations of same program in order to avoid identical design faults Block Recovery Duplication of various critical software modules Reliable Communication One one communication Use reliable transport protocols TCP of handle at the application layer Possibilities 1 2 3 4 5 Client unable to locate server Lost request messages Server crashes after receiving request Lost reply messages Client crashes after sending request Cont d One many Communication Reliable Multicast 1 Lost messages need to retransmit Possibilities 1 2 ACK based Schemes Sender can become bottleneck NACK based systems Distributed Commit Atomic multicast all processes in a group perform an operation or not at all Problem of Distributed commit all or nothing operations in a group of processes Possible approaches 2 phase commit and 3 phase commit Cont d 2 Phase 3 Phase commit Coordinator process coordinates the operation Involves 2 phases Voting phase processes vote on whether to commit Decision phase actually commit or abort Problem If coordinator crashes then processes block 3 Phase commit Variant of 2 phase that avoids blocking Recovery Techniques thus far allow Failure handling Recovery means operations to a correct state that must be performed after a failure to recover to a correct state Techniques 1 2 Check Pointing Message Logging Check Pointing Periodically checkpoint state Upon crash roll back to a previous checkpoint with a consistent state Types Independent Checking Coordinated Checking Message Logging Check pointing is expensive 1 2 3 All processes restart from previous consistent cut Taking a snapshot is expensive All computations from previous snapshot have to be redone Combine check pointing expensive with message logging cheap 1 2 3 Take infrequent checkpoints Log all messages between checkpoints to local stable storage To recover Simply replay messages from previous checkpoint and avoid recomputations from previous checkpoint On going Research Intelligent Adaptive Fault Tolerance Summary References Fault Tolerance in Distributed Systems by Pankaj Jalote Adaptive Fault tolerance in Distributed Systems by Roger Bharath Melanie Dumas and Mevlut Erdem Kurul Questions
View Full Document