Recovery StrategiesEvent Sequence–fault detection–fault location–system reconfiguration–system recovery »different for distributed and shared memory systems–continuation of operationTrivial Recovery–termination of program and re-execution»performance disadvantage, unrealistic© 2007 A.W. Krings1Recovery StrategiesDesirable Recovery–Forward Recovery»Would like to continue execution right from where the error was detected and ensure alternative mechanism to ensure correctness–Rollback Recovery»Rollback program to a previous correct state»If rollback is inevitable, prefer not to roll back too far»Need to frequently store system state that can be “rolled back” to–Checkpointing»Storing system state information at discrete points in the program»Q: What constitutes a system state? »A: Depends on system…© 2007 A.W. Krings2Recovery StrategiesForward Recovery–e.g. Recovery blocks»approach based on software redundancy»introduced by Brian Randell (1975)»basic idea is that a language construct is used that supports software redundancy assume acceptance test T and “try blocks” BiB1 is the primary try block, Bk is the (k-1)th alternative ensure T by B1 else by B2 … else by Bn else error© 2007 A.W. Krings3Recovery StrategiesRollback Recovery based on Checkpoints–We will first discuss this topic at the processor level»Processor Cache-Based Checkpoints»Virtual CheckpointsProcessor Cache-Based Checkpoints–Powerful for building a machine that can tolerate transient faults–Checkpoint defined by storing registers in safe area of main memory and writing back cache lines to main memory.»safe main memory area (stable storage) »=> e.g. battery back-up, redundant memory–Program now executes using active data in cache.»i.e. Write-through cache policy is not suitable–Cache miss forces a checkpoint© 2007 A.W. Krings4Recovery Strategies–Example»Active State: CPU, Registers, Cache»Checkpoint State: Main Memory, CPU Register Save Area»Cache employs copy-back policy, i.e. changes are made in cache only.»Now assume a parity error»Rollback by invalidating dirty cache line or pagereload register state from CPU Register Save AreaCPU RegistersMain MemoryCPU Register Save AreaCachecheckpoint stateactive statePradhan96 Fig 3.7© 2007 A.W. Krings5Recovery Strategies–What happens if failure occurs during copying the state?»then the checkpoint state has been partially updated»neither old nor new state are valid–Sequoia fault tolerant multiprocessor»duplex processors operating in lockstep, »duplexed main memory banks provide atomic checkpoint process»uses 2 banks and time stamps (Pradhan96 Fig 3.8)»sequence: TA1 - flush - TA2 - TB1 - flush - TB2»use partial order of time stamps to determine which bank is corruptTimestamp A1 flush dataTimestamp A2Timestamp B1 flush dataTimestamp B2© 2007 A.W. Krings6Recovery StrategiesConditionFailureActionTA1 = TA2 = TB1 = TB2NoneNoneTA1 > TA2 = TB1 = TB2Flush ACopy Bank-B to ATA1 = TA2 > TB1 = TB2BetweenCopy Bank-A to BTA1 = TA2 = TB1 > TB2Flush BCopy Bank-A to BPradhan96 table 3.4Note that only three time stamps are needed since TA2=TB1Interpretation of “=“: E.g. if both TA1 and TA2 are written (TA1 = TA2) then checkpoint has been successfully written.© 2007 A.W. Krings7Recovery StrategiesVirtual Checkpoints–Drawbacks of cache based checkpointing»checkpoint frequency depends on cache size »checkpoint frequency can be very high»resulting in high performance overhead–When checkpoint frequency is high => move strategy from processor cache into virtual memory.»thus include the state of memory into the checkpoint »use disk to store checkpoint»derivation of scheme where global and local checkpoint numbers V and v are used. »now, the active page will become the checkpoint when V is updated.© 2007 A.W. Krings8Recovery Strategies–So far only uni-processor has been considered–“Processor Pair” approach»used in Tandem »primary process executes actively»backup process is inactive»primary periodically sends up-to-date messages containing its state to the backup process»same principle as traditional roll-back scheme, however instead of checkpoint, the state is saved in backup process»assumes Fail-Stop behavior of processorsPrimaryProcessBackupProcesscheckpointed statePrimaryProcessBackupProcessI’m alive messageI’m alive message© 2007 A.W. Krings9Recovery StrategiesRollback Recovery in Multiprocessors–communication needs to be considered»effect of communication since last checkpoint»communication might have inherent delays–different issues in different systems–tightly coupled multiprocessors (shared memory)–loosely coupled multiprocessors (no shared memory)© 2007 A.W. Krings10Recovery StrategiesTightly Coupled Multiprocessors–use write-back strategy»possibility for multiple copies of the same data in different caches»need cache coherence protocolif different processors update a certain cached page, inconsistency may/will arise»write-through strategy would result in extremely large checkpoint frequency since every write would result in new checkpoint© 2007 A.W. Krings11Recovery Strategies–Cache Coherence Protocols»single-bus-basedwrite-invalidate: –whenever write is to be performed, send out signal to invalidate all preexisting copies of the same piece of data in other caches before updating the local copy in the cachewrite-update:–better if writes are less frequent–deliver write-data to other users simultaneously whenever a write access to a shared cache block is generated»directory-baseduse table (called directory) to keep tabs on the state of each memory blockvarious schemes differ in their table management© 2007 A.W. Krings12Recovery Strategies–The Problem of Rollback Propagation»with multiple processes data dependency might cause one rollback to trigger a domino effect, »i.e. the rollback of a process requires the rollback of another process ...–Ahmed’s Solution»additional bus lines shared: set by processor to indicate sharing of a block on the busestablish rollback point: set by processor to indicate that a rollback point is being established–this causes all other processes to also establish checkpoints–causes problems as the number of processes increasesrollback: set by processor to indicate that it is backing up to the prior
View Full Document