UI CS 449 - Recovery Strategies
Course Cs 449-
Pages 26

Unformatted text preview:

Recovery StrategiesEvent Sequence–fault detection–fault location–system reconfiguration–system recovery »different for distributed and shared memory systems–continuation of operationTrivial Recovery–termination of program and re-execution»performance disadvantage, unrealistic© 2007 A.W. Krings1Recovery StrategiesDesirable Recovery–Forward Recovery»Would like to continue execution right from where the error was detected and ensure alternative mechanism to ensure correctness–Rollback Recovery»Rollback program to a previous correct state»If rollback is inevitable, prefer not to roll back too far»Need to frequently store system state that can be “rolled back” to–Checkpointing»Storing system state information at discrete points in the program»Q: What constitutes a system state? »A: Depends on system…© 2007 A.W. Krings2Recovery StrategiesForward Recovery–e.g. Recovery blocks»approach based on software redundancy»introduced by Brian Randell (1975)»basic idea is that a language construct is used that supports software redundancy assume acceptance test T and “try blocks” BiB1 is the primary try block, Bk is the (k-1)th alternative ensure T by B1 else by B2 … else by Bn else error© 2007 A.W. Krings3Recovery StrategiesRollback Recovery based on Checkpoints–We will first discuss this topic at the processor level»Processor Cache-Based Checkpoints»Virtual CheckpointsProcessor Cache-Based Checkpoints–Powerful for building a machine that can tolerate transient faults–Checkpoint defined by storing registers in safe area of main memory and writing back cache lines to main memory.»safe main memory area (stable storage) »=> e.g. battery back-up, redundant memory–Program now executes using active data in cache.»i.e. Write-through cache policy is not suitable–Cache miss forces a checkpoint© 2007 A.W. Krings4Recovery Strategies–Example»Active State: CPU, Registers, Cache»Checkpoint State: Main Memory, CPU Register Save Area»Cache employs copy-back policy, i.e. changes are made in cache only.»Now assume a parity error»Rollback by invalidating dirty cache line or pagereload register state from CPU Register Save AreaCPU RegistersMain MemoryCPU Register Save AreaCachecheckpoint stateactive statePradhan96 Fig 3.7© 2007 A.W. Krings5Recovery Strategies–What happens if failure occurs during copying the state?»then the checkpoint state has been partially updated»neither old nor new state are valid–Sequoia fault tolerant multiprocessor»duplex processors operating in lockstep, »duplexed main memory banks provide atomic checkpoint process»uses 2 banks and time stamps (Pradhan96 Fig 3.8)»sequence: TA1 - flush - TA2 - TB1 - flush - TB2»use partial order of time stamps to determine which bank is corruptTimestamp A1 flush dataTimestamp A2Timestamp B1 flush dataTimestamp B2© 2007 A.W. Krings6Recovery StrategiesConditionFailureActionTA1 = TA2 = TB1 = TB2NoneNoneTA1 > TA2 = TB1 = TB2Flush ACopy Bank-B to ATA1 = TA2 > TB1 = TB2BetweenCopy Bank-A to BTA1 = TA2 = TB1 > TB2Flush BCopy Bank-A to BPradhan96 table 3.4Note that only three time stamps are needed since TA2=TB1Interpretation of “=“: E.g. if both TA1 and TA2 are written (TA1 = TA2) then checkpoint has been successfully written.© 2007 A.W. Krings7Recovery StrategiesVirtual Checkpoints–Drawbacks of cache based checkpointing»checkpoint frequency depends on cache size »checkpoint frequency can be very high»resulting in high performance overhead–When checkpoint frequency is high => move strategy from processor cache into virtual memory.»thus include the state of memory into the checkpoint »use disk to store checkpoint»derivation of scheme where global and local checkpoint numbers V and v are used. »now, the active page will become the checkpoint when V is updated.© 2007 A.W. Krings8Recovery Strategies–So far only uni-processor has been considered–“Processor Pair” approach»used in Tandem »primary process executes actively»backup process is inactive»primary periodically sends up-to-date messages containing its state to the backup process»same principle as traditional roll-back scheme, however instead of checkpoint, the state is saved in backup process»assumes Fail-Stop behavior of processorsPrimaryProcessBackupProcesscheckpointed statePrimaryProcessBackupProcessI’m alive messageI’m alive message© 2007 A.W. Krings9Recovery StrategiesRollback Recovery in Multiprocessors–communication needs to be considered»effect of communication since last checkpoint»communication might have inherent delays–different issues in different systems–tightly coupled multiprocessors (shared memory)–loosely coupled multiprocessors (no shared memory)© 2007 A.W. Krings10Recovery StrategiesTightly Coupled Multiprocessors–use write-back strategy»possibility for multiple copies of the same data in different caches»need cache coherence protocolif different processors update a certain cached page, inconsistency may/will arise»write-through strategy would result in extremely large checkpoint frequency since every write would result in new checkpoint© 2007 A.W. Krings11Recovery Strategies–Cache Coherence Protocols»single-bus-basedwrite-invalidate: –whenever write is to be performed, send out signal to invalidate all preexisting copies of the same piece of data in other caches before updating the local copy in the cachewrite-update:–better if writes are less frequent–deliver write-data to other users simultaneously whenever a write access to a shared cache block is generated»directory-baseduse table (called directory) to keep tabs on the state of each memory blockvarious schemes differ in their table management© 2007 A.W. Krings12Recovery Strategies–The Problem of Rollback Propagation»with multiple processes data dependency might cause one rollback to trigger a domino effect, »i.e. the rollback of a process requires the rollback of another process ...–Ahmed’s Solution»additional bus lines shared: set by processor to indicate sharing of a block on the busestablish rollback point: set by processor to indicate that a rollback point is being established–this causes all other processes to also establish checkpoints–causes problems as the number of processes increasesrollback: set by processor to indicate that it is backing up to the prior


View Full Document

UI CS 449 - Recovery Strategies

Course: Cs 449-
Pages: 26
Download Recovery Strategies
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Recovery Strategies and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Recovery Strategies 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?