DOC PREVIEW
FSU COP 5611 - Lecture 26 OS Reliability

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems Some Axioms Some simple systems designed from scratch sometimes work A complex system that works is invariably found to have evolved from a simple system that works A complex system designed from scratch never works Failure Mode Theorems Complex system usually operate in failure mode A system should have safe behaviors when encountering failures When a fail safe system fails it fails by failing to fail safe Some definitions Failure of a system occurs when the system does not perform its services in the manner specified Fault is anomalous physical condition Sometimes failures are subtle e g performance fault Includes system specification implementation mistakes Error is part of system state that differs from its intended value Classification of Failures Process failures System failures Secondary storage failures Communication medium failures Process Failures Examples Errors leading to failure Computation results in incorrect outcome System state deviates from specification Process fails to progress Deadlock timeout protection violation Bad input consistency violation Ignoring malicious behavior System Failures Processor fails to execute Software error hardware error CPU bus etc Fail stop behavior assumed Failure types Amnesia Partial amnesia Pause Halting Secondary Storage Failures Stored data inaccessible Parity error Head crash Contaminated medium Reconstructable from archive log maybe Mirrored disks independent failure mode Communication Medium Failures Site can t communicate with another site Causes Switching node failure Link failure Hardware failure Software failure Congestion Hardware Implementation failure Network partitions can result Recovery Restart process processor Reclaim resources Undo finish incomplete transactions Concurrency makes things harder Forward Error Recovery Goal To restore system from erroneous state to error free state If nature of error is completely known Remove error from state Proceed with execution from error free state Rarely possible to do Backward Error Recovery When error source unknown Restore state to previous error free state restart Independent of fault errors causing fault Problems Performance penalty No guarantee fault will not reoccur Possible unrecoverable component of state Recovery point state used to replace error Backward Error Recovery Basic approaches Operation based Logs Update in place Write ahead log State based Update in Place Every update to object also records the log Name of object Old and new states of object Recoverable update operation implements as Do undo redo operations Write ahead Log Update in place has problem if crash occurs between update and log recorded to stable storage Update object only after undo log recorded Before committing updates record both redo and undo logs Expensive to write log to stable storage State Based Recovery Save entire process state at recovery point Recovery point called checkpoint Rolling back process restoring to checkpoint Tradeoff frequent checkpoints vs completion delay Shadow pages Save unmodified page copy on stable storage Update only volatile copy discard on rollback Concurrent Systems Recovery Rollback issues Orphan messages Domino effect Lost messages Livelocks Orphan Messages X x1 y1 Y Z x2 z1 z2 recovery point x3 y2 m Domino Effect Suppose Y rolls back to y2 m is orphan message Process X must rollback to x2 Suppose Z rolls back to z2 Y rolls back to y1 Forcing Z to roll back to z1 Lost Messages x1 X m z1 Z failure recovery point Live Locks X x1 Z z1 repeated failure recovery point Concurrent Recovery Coordination required at either time of establishing checkpoints Beginning of recovery Checkpoint Assumptions Communication via messages Unreliable FIFO channels Higher level end to end protocols assumed Subsumes rollback caused message loss No network partitions from communication failures Checkpoint Algorithm Concepts Permanent and tentative checkpoints Saved on stable storage Permanent part of known consistent global checkpoint Tentative until successful termination of checkpoint algorithm Rolls back only to permanent checkpoints Synchronous Checkpoint Algorithms Two phase commit Problems Message overhead for synchronizations Synchronization delays Costly when failures are rare Asynchronous Checkpointing Local checkpoints taken independently Log all incoming messages on stable storage Minimizes undone computation Allows reprocessing of messages after rollback Asynchronous Checkpointing Assumptions Assumptions Reliable FIFO communication channels Infinite buffers Event driven computation A process idle until message received Processes message and change state Sends zero or more messages Can identify each event with monotonically increasing counter Event Driven Computation x1 x2 X y1 y2 Y z1 Z z2 Asynchronous Checkpointing Basic idea Save states messages sent at each event Volatile logging Each processor notes number of messages sent to others and received from others Use counters to determine orphan messages Summary Failures caused by errors Can remove errors by forward backward error recovery Backward error recovery more costly more general Synchronous checkpoints helpful costly Asynchronous checkpoints messier domino effects


View Full Document

FSU COP 5611 - Lecture 26 OS Reliability

Documents in this Course
Load more
Download Lecture 26 OS Reliability
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 26 OS Reliability and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 26 OS Reliability and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?