UMass Amherst CS 677 - Today- Fault Tolerance (14 pages)

Previewing pages 1, 2, 3, 4, 5 of 14 page document View the full content.
View Full Document

Today- Fault Tolerance



Previewing pages 1, 2, 3, 4, 5 of actual document.

View the full content.
View Full Document
View Full Document

Today- Fault Tolerance

59 views


Pages:
14
School:
University of Massachusetts Amherst
Course:
Cs 677 - Distributed and Operating Systems

Unformatted text preview:

Today Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing Message logging Computer Science CS677 Distributed OS Lecture 16 page 1 Replica Management Replica server placement Web geophically skewed request patterns Where to place a proxy K clusters algorithm Permanent replicas versus temporary Mirroring all replicas mirror the same content Proxy server on demand replication Server initiated versus client initiated Computer Science CS677 Distributed OS Lecture 16 page 2 Content Distribution Will come back to this in Chap 12 CDN network of proxy servers Caching update versus invalidate Push versus pull based approaches Stateful versus stateless Web caching what semantics to provide Computer Science CS677 Distributed OS Lecture 16 page 3 Final Thoughts Replication and caching improve performance in distributed systems Consistency of replicated data is crucial Many consistency semantics models possible Need to pick appropriate model depending on the application Example web caching weak consistency is OK since humans are tolerant to stale information can reload browser Implementation overheads and complexity grows if stronger guarantees are desired Computer Science CS677 Distributed OS Lecture 16 page 4 Fault Tolerance Single machine systems Failures are all or nothing OS crash disk failures Distributed systems multiple independent nodes Partial failures are also possible some nodes fail Question Can we automatically recover from partial failures Important issue since probability of failure grows with number of independent components nodes in the systems Prob failure Prob Any one component fails 1 P no failure Computer Science CS677 Distributed OS Lecture 16 page 5 A Perspective Computing systems are not very reliable OS crashes frequently Windows buggy software unreliable hardware software hardware



View Full Document

Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view Today- Fault Tolerance and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Today- Fault Tolerance and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?