CORNELL CS 514 - Reliable Distributed System - D2347839

Home> Schools> Cornell University> Computer Science (CS) > CS 514> Reliable Distributed System

DOC PREVIEW

CORNELL CS 514 - Reliable Distributed System

School name Cornell University

Course Cs 514- Intermediate Computer Systems

Pages 59

This preview shows page 1-2-3-4-27-28-29-30-56-57-58-59 out of 59 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 59 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Reliable Distributed SystemsReliability and transactionsTypes of reliabilityReplicating a transactional serverReplication with 2PCObservationReplication and AvailabilityUsual responses?Quorum exampleThings to noticeNext issue?Availability of 2PCWhat can be done?A quandry?Other optionsServer replicationPrimary/backupSlide 18Slide 19Issues?Split brain: reminderSlide 22Slide 23Implication?Real systemsHow does hardware help?ReconciliationSummaryReplication and High AvailabilitySteps to a solutionNon-blocking CommitDefinition of problemNon-trivialityTypical protocolCommit protocol illustratedSlide 36Slide 37Failure issuesFailure model impacts costs!Commit with simpler failure modelSlide 41Example of a hard scenarioSlide 43Slide 44Skeen: Three-phase commitSlide 46Three phase commit protocol illustratedObservations about 3PCAssumptions about failuresProblems with 3PCSituation in practical systems?Process groupsFailure detectionArchitectureSlide 55Slide 56IssuesGMP designReading ahead?Reliable Distributed SystemsFault Tolerance (Recoverability  High Availability)Reliability and transactionsTransactions are well matched to database model and recoverability goalsTransactions don’t work well for non-database applications (general purpose O/S applications) or availability goals (systems that must keep running if applications fail)When building high availability systems, encounter replication issueTypes of reliabilityRecoverabilityServer can restart without intervention in a sensible stateTransactions do give us thisHigh availabilitySystem remains operational during failureChallenge is to replicate critical data needed for continued operationReplicating a transactional serverTwo broad approachesJust use distributed transactions to update multiple copies of each replicated data itemWe already know how to do this, with 2PCEach server has “equal status”Somehow treat replication as a special situationLeads to a primary server approach with a “warm standby”Replication with 2PCOur goal will be “1-copy serializability”Defined to mean that the multi-copy system behaves indistinguishably from a single-copy systemConsiderable form and theoretical work has been done on thisAs a practical matterReplicate each data itemTransaction managerReads any single copy Updates all copiesObservationNotice that transaction manager must know where the copies resideIn fact there are two modelsStatic replication set: basically, the set is fixed, although some members may be downDynamic: the set changes while the system runs, but only has operational members listed within itToday stick to the static caseReplication and AvailabilityA series of potential issuesHow can we update an object during periods when one of its replicas may be inaccessible?How can 2PC protocol be made fault-tolerant?A topic we’ll study in more depthBut the bottom line is: we can’t!Usual responses?Quorum methods:Each replicated object has an update and a read quorumDesigned so Qu+Qr > # replicas and Qu+Qu > # replicasIdea is that any read or update will overlap with the last updateQuorum exampleX is replicated at {a,b,c,d,e}Possible values?Qu = 1, Qr = 5 (violates QU+Qu > 5)Qu = 2, Qr = 4 (same issue)Qu = 3, Qr = 3Qu = 4, Qr = 2Qu = 5, Qr = 1 (violates availability)Probably prefer Qu=4, Qr=2Things to noticeEven reading a data item requires that multiple copies be accessed!This could be much slower than normal local access performanceAlso, notice that we won’t know if we succeeded in reaching the update quorum until we get responsesImplies that any quorum replication scheme needs a 2PC protocol to commitNext issue?Now we know that we can solve the availability problem for reads and updates if we have enough copiesWhat about for 2PC?Need to tolerate crashes before or during runs of the protocolA well-known problemAvailability of 2PCIt is easy to see that 2PC is not able to guarantee availabilitySuppose that manager talks to 3 processesAnd suppose 1 process and manager failThe other 2 are “stuck” and can’t terminate the protocolWhat can be done?We’ll revisit this issue soonBasically,Can extend to a 3PC protocol that will tolerate failures if we have a reliable way to detect themBut network problems can be indistinguishable from failuresHence there is no commit protocol that can tolerate failuresAnyhow, cost of 3PC is very highA quandry?We set out to replicate data for increased availabilityAnd concluded thatQuorum scheme works for updatesBut commit is requiredAnd represents a vulnerabilityOther options?Other optionsWe mentioned primary-backup schemesThese are a second way to solve the problemBased on the log at the data managerServer replicationSuppose the primary sends the log to the backup serverIt replays the log and applies committed transactions to its replicated stateIf primary crashes, the backup soon catches up and can take overPrimary/backupprimary backupClients initially connected to primary, which keeps backup up to date. Backup tracks loglogPrimary/backupprimary backupPrimary crashes. Backup sees the channel break, applies committed updates. But it may have missedthe last few updates!Primary/backupprimary backupClients detect the failure and reconnect to backup. Butsome clients may have “gone away”. Backup state couldbe slightly stale. New transactions might suffer from thisIssues?Under what conditions should backup take overRevisits the consistency problem seen earlier with clients and serversCould end up with a “split brain”Also notice that still needs 2PC to ensure that primary and backup stay in same states!Split brain: reminderprimary backupClients initially connected to primary, which keeps backup up to date. Backup follows loglogSplit brain: reminderTransient problem causes some links to break but not all.Backup thinks it is now primary, primary thinks backup is downprimarybackupSplit brain: reminderSome clients still connected to primary, but one has switchedto backup and one is completely disconnected from bothprimarybackupImplication?A strict interpretation of ACID leads to conclusions thatThere are no ACID replication schemes that provide high availabilityMost real systems solve by weakening ACIDReal systemsThey use primary-backup with

View Full Document