SMU CSE 8343 - Fault Tolerance in Distributed Systems

Unformatted text preview:

Presentation-2 Group-A1Fault Tolerance in Distributed SystemsOutlineFault ToleranceImportance in Distributed SystemsTypes of FaultsMeasurement of FaultsFailure ModelsRedundancy and its FormsSoftware Fault tolerance TechniquesReliable CommunicationCont’dDistributed CommitSlide 14RecoveryCheck PointingMessage LoggingOn-going ResearchReferences:Questions ?Presentation-2Group-A1Professor:Mohamed KhalilAnita KanugantiHemanth RaoFault Tolerance in Distributed SystemsOutlineOverview of Fault ToleranceImportance in Distributed systemsTypes of FaultsMeasurement of FaultsFailure ModelsRedundancy and Forms of RedundancySoftware Fault Tolerance TechniquesReliable CommunicationDistributed CommitFailure RecoveryOn-going ResearchReferencesFault ToleranceThe ability of a system to respond gracefully to an unexpected Hardware or Software FailureThere are many levels of Fault tolerance, the lowest being the ability to continue operation in the event of power failure.Importance in Distributed SystemsComputer systems are not very reliable---OS crashes frequently(Windows),buggy software,unreliable Hardware,SW/HW incompatibilities---Growing popularity of Internet/World Wide Web---Example: what if your TV(or car) broke down every day? Users don’t want to restart TV or fix it by opening it up.So we need to make our computer systems more Reliable and Dependable.Types of FaultsNature ---Systematic ---RandomDuration ---Transient ---Intermittent ---PermanentExtent ---Global ---LocalMeasurement of FaultsFault Removal CoverageFault Detection CoverageFault Tolerance CoverageFailure ModelsType of Failure DescriptionCrash Failure Server halts but is working correctly until it haltsOmission Failure Server fails to respond to incoming requestsTiming Failure Server’s response lies outside the specified time intervalResponse Failure Server’s response is incorrectArbitrary Failure Server may produce arbitrary responses at arbitrary timesRedundancy and its FormsRedundancy does same computation for ‘n’ number of times. So if one fails the other will operateForms of Redundancy ---Hardware Redundancy ---Software Redundancy ---Information Redundancy ---Temporal(time) RedundancySoftware Fault tolerance TechniquesN-Version Programming --- Different implementations of same program in order to avoid identical design faultsBlock Recovery --- Duplication of various critical software modulesReliable CommunicationOne-one communication --- Use reliable transport protocols(TCP) of handle at the application layer --- Possibilities1. Client unable to locate server2. Lost request messages3. Server crashes after receiving request4. Lost reply messages5. Client crashes after sending requestCont’dOne-many Communication ---Reliable Multicast1. Lost messages need to retransmit ---Possibilities1.1.ACK-based Schemes-Sender can become ACK-based Schemes-Sender can become bottleneckbottleneck2.2.NACK-based systemsNACK-based systemsDistributed CommitAtomic multicast-all processes in a group perform an operation or not at allProblem of Distributed commit---all or nothing operations in a group of processesPossible approaches---2-phase commit and 3-phase commitCont’d2-Phase & 3-Phase commitCoordinator process coordinates the operationInvolves 2 phases ---Voting phase-processes vote on whether to commit ---Decision phase-actually commit or abortProblem- If coordinator crashes then processes block3-Phase commit – Variant of 2-phase that avoids blockingRecoveryTechniques thus far allow Failure handlingRecovery means operations to a correct state that must be performed after a failure to recover to a correct stateTechniques:1. Check Pointing2. Message LoggingCheck PointingPeriodically checkpoint stateUpon crash roll back to a previous checkpoint with a consistent stateTypes: -- Independent Checking -- Coordinated CheckingMessage LoggingCheck pointing is expensive1.1.All processes restart from previous consistent cutAll processes restart from previous consistent cut2.2.Taking a snapshot is expensiveTaking a snapshot is expensive3.3.All computations from previous snapshot have to be All computations from previous snapshot have to be redone.redone.Combine check pointing(expensive) with Combine check pointing(expensive) with message logging(cheap)message logging(cheap)1.1.Take infrequent checkpointsTake infrequent checkpoints2.2.Log all messages between checkpoints to local stable Log all messages between checkpoints to local stable storagestorage3.3.To recover: Simply replay messages from previous To recover: Simply replay messages from previous checkpoint and avoid recomputations from previous checkpoint and avoid recomputations from previous checkpointcheckpointOn-going ResearchIntelligent / Adaptive Fault ToleranceSummaryReferences:“Fault Tolerance in Distributed Systems” by Pankaj Jalote“Adaptive Fault tolerance in Distributed Systems” by Roger Bharath, Melanie Dumas and Mevlut Erdem KurulQuestions


View Full Document
Download Fault Tolerance in Distributed Systems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Fault Tolerance in Distributed Systems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Fault Tolerance in Distributed Systems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?