Unformatted text preview:

CS 372: Operating Systems Mike Dahlin 1 Lecture #25: 2-phase commit ********************************* Review -- 1 min ********************************* Motivation Basic NW communication 3 problems  performance  reliability  security Case study: Distributed file systems ********************************* Outline - 1 min **********************************  General’s paradox  2-phase commit  Reliable message queues ********************************* Preview - 1 min ********************************* If time permits: security ********************************* Lecture - 20 min ********************************* 1. TBD Finish file systems 2. Reliability Lamport: “A distributed system is a system where I can’t get any work done if a machine I’ve never heard of crashes.”CS 372: Operating Systems Mike Dahlin 2 3. General’s paradox Want to be able to reliably coordinate activity on two different machines (e.g., both do the same thing at same time, exactly once semantics, atomically update state on two different machines, etc.) e.g., atomically move directory from file server A to file server B e.g., atomically move $100 from my account to Visa account Challenge:  messages can be lost  machines can crash Can I use messages and retries over an unreliable network to synchronize two machines so that they are guaranteed to do same op at same time? Remarkably, no. Even if all messages end up getting through. Even if no machines crash. General’s paradox: two generals on separate mountains. Can only communicate via messengers; the messengers can get lost or be captured Need to coordinate the attack; if they attack at different times, then they all die. If they attack at same time, they win. A B 11AM OK? OK. 11’s good for me so, 11 it is? Yeah, but what if you dont get this ackCS 372: Operating Systems Mike Dahlin 3 Even if all messages are delivered, can’t coordinate (B/c a chance that the last message doesn’t get through). Can’t simultaneously get two machines to agree to do something at same time No solution to this – one of the few things in CS that is just impossible. Proof: by induction 3.1 Network failures Since I cannot solve General’s Paradox, let me solve a related problem: at least once delivery For now, assume no machine failures. Just network failures. (1) communication interruption  lost message  lost reply  cut wire  … Simple solution: Request/acknowledge protocol Common case: 1) Sender sends message (msg, msgId) and sets timer 2) Receiver receives message and sends (ack, msgId) 3) Sender receives (ack, msgId) and clears timer If timer goes off, goto (1) How does this work? What does it guarantee?  What if msg 1 lost?  What if ack lost? Guarantees at least once semantics assuming no machines crash or otherwise discontinue protocol  Receiver guaranteed to recv message at least onceCS 372: Operating Systems Mike Dahlin 4  Receiver may recv message multiple times. Receiver MAY use sequence number to filter repeated transmissions so that each is acted upon just once  3.1.1 At least once delivery safety: If call at sender returns, message was processed by receiver at least once liveness: if sender repeatedly sends until call returns and network eventually repaired and operates correctly long enough for a send/receive to occur, then eventually message is processed by receiver (at least once) [[until call returns => no crash, no timeout/give up]] Example: NFS “idempotent” requests 3.1.2 Exactly once delivery Example: TCP/IP reliable stream safety: If call at sender returns, message was processed by receiver exactly once liveness: if sender repeatedly sends until call returns and network eventually repaired and operates correctly long enough for a send/receive to occur, then eventually message is processed by receiver (exactly once) [[note: implementation typically requires sender and receiver to maintain state; cannot lose state in crash...] 3.1.3 Limitation: What if a machine crashes? Do we still get “at least once” semantics if machines can crash? NFS/RPC: no. NFS Solution – blocking calls – don’t return until remote operation completes.CS 372: Operating Systems Mike Dahlin 5 Note: after a crash, operation may have happened zero, once, or ten times. Do we still get exactly once semantics if machine can crash? TCP: no TCP solution: (1) If sender or receiver crash or network partition causes either to give up, no guarantee of “at least once” – local send may complete before data received by remote machine  if send request not return, data may be received 0 or 1 time  if send request does return, data may be received 0 or 1 time (2) at most once semantic – if crash causes sender to reuse sequence numbers, no guarantee of at most once…s hacks to make it very unlikely – pick sequence numbers unlikely to overlap with prev attempts; don’t re-use port numbers until “pretty sure” both sides know connection is closed (two generals)  very unlikely that after receiver crashes, a resend will be accepted as a first send (~at most once semantics…) Don’t just worry about crashes. What about “giving up.” Suppose I try to send for 10 seconds and get no reply – should I report “failure” to the user? 1 minute? 10 hours? What are at least once/at most once semantics now? Bottom line: If machines can crash or give up (e.g., during a network partition), then messages can be received 0, 1, or N times  these things help  but still have corner cases to worry about  These corner cases sometimes OK (e.g., TCP/IP – if one side gives up, eventually tear down the connection and hand an error up to higher level – let the higher level protocol recover (or exit)  Sometimes they require recovery protocols (e.g., AFS callback recovery)CS 372: Operating Systems Mike Dahlin 6 Can we provide a more powerful abstraction? 4. Machine failures Several variations: ♦ user level bug causes address space to crash ♦ machine failure, kernel bug causes all AS on same machine to fail ♦ power outage causes all machines to fail Before, whole system would crash. Now: one machine can crash, while others stay up. Now, one machine can crash, while others stay up. If file server goes down, what do the other machines do? Example: simple send/ack


View Full Document

UT CS 372 - LECTURE NOTES

Documents in this Course
MapReduce

MapReduce

17 pages

Processes

Processes

19 pages

MapReduce

MapReduce

17 pages

Load more
Download LECTURE NOTES
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view LECTURE NOTES and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view LECTURE NOTES 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?