CS514: Intermediate Course in Operating SystemsVirtual SynchronyWhy “virtual” synchrony?A synchronous executionVirtual Synchrony at a glanceSlide 6Chances to “weaken” orderingCausally ordered updatesIn general?Slide 10Correlated failuresProgramming with groupsEmbedding groups into “tools”Distributed algorithmsSlide 15Slide 16Distributed algorithms: SummaryMore tools: fault-tolerancePublish / SubscribeScalability warnings!Publish / Subscribe issue?Other “toolkit” ideasOther similar ideasExisting toolkits: challengesPreserving orderThe tradeoffSolution used in HorusOther toolkit “issues”Features of major virtual synchrony platformsSlide 30Slide 31Horus/JGroups/Ensemble protocol stacksJGroups (part of JBoss)Spread ToolkitSummary?CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanKrzys Ostrowski: TAVirtual SynchronyA powerful programming model!Called virtual synchronyIt offersProcess groups with state transfer, automated fault detection and membership reportingOrdered reliable multicast, in several flavorsExtremely good performanceWhy “virtual” synchrony?What would a synchronous execution look like?In what ways is a “virtual” synchrony execution not the same thing?A synchronous executionpqrstuWith true synchrony executions run in genuine lock-step.Virtual Synchrony at a glanceWith virtual synchrony executions only look “lock step” to the applicationpqrstuVirtual Synchrony at a glancepqrstuWe use the weakest (hence fastest) form of communication possibleChances to “weaken” orderingSuppose that any conflicting updates are synchronized using some form of lockingMulticast sender will have mutual exclusionHence simply because we used locks, cbcast delivers conflicting updates in order they were performed!If our system ever does see concurrent multicasts… they must not have conflicted. So it won’t matter if cbcast delivers them in different orders at different recipients!Causally ordered updatesEach thread corresponds to a different lockIn effect: red “events” never conflict with green ones!prst1234512In general?Replace “safe” (dynamic uniformity) with a standard multicast when possibleReplace abcast with cbcastReplace cbcast with fbcastUnless replies are needed, don’t wait for replies to a multicastWhy “virtual” synchrony?The user sees what looks like a synchronous executionSimplifies the developer’s taskBut the actual execution is rather concurrent and asynchronousMaximizes performanceReduces risk that lock-step execution will trigger correlated failuresCorrelated failuresWhy do we claim that virtual synchrony makes these less likely?Recall that many programs are buggyOften these are Heisenbugs (order sensitive)With lock-step execution each group member sees group events in identical orderSo all die in unisonWith virtual synchrony orders differSo an order-sensitive bug might only kill one group member!Programming with groupsMany systems just have one groupE.g. replicated bank serversCluster mimics one highly reliable serverBut we can also use groups at finer granularityE.g. to replicate a shared data structureNow one process might belong to many groupsA further reason that different processes might see different inputs and event ordersEmbedding groups into “tools”We can design a groups API:pg_join(), pg_leave(), cbcast()…But we can also use groups to build other higher level mechanismsDistributed algorithms, like snapshotFault-tolerant request executionPublish-subscribeDistributed algorithmsProcesses that might participate join an appropriate groupNow the group view gives a simple leader election ruleEveryone sees the same members, in the same order, ranked by when they joinedLeader can be, e.g., the “oldest” processDistributed algorithmsA group can easily solve consensusLeader multicasts: “what’s your input”?All reply: “Mine is 0. Mine is 1”Initiator picks the most common value and multicasts that: the “decision value”If the leader fails, the new leader just restarts the algorithmPuzzle: Does FLP apply here?Distributed algorithmsA group can easily do consistent snapshot algorithmEither use cbcast throughout system, or build the algorithm over gbcastTwo phases:Start snapshot: a first cbcastFinished: a second cbcast, collect process states and channel logsDistributed algorithms: SummaryLeader electionConsensus and other forms of agreement like votingSnapshots, hence deadlock detection, auditing, load balancingMore tools: fault-toleranceSuppose that we want to offer clients “fault-tolerant request execution”We can replace a traditional service with a group of membersEach request is assigned to a primary (ideally, spread the work around) and a backupPrimary sends a “cc” of the response to the request to the backupBackup keeps a copy of the request and steps in only if the primary crashes before replyingSometimes called “coordinator/cohort” just to distinguish from “primary/backup”Publish / SubscribeGoal is to support a simple API:Publish(“topic”, message)Subscribe(“topic”, event_hander)We can just create a group for each topicPublish multicasts to the groupSubscribers are the membersScalability warnings!Many existing group communication systems don’t scale incredibly wellE.g. JGroups, Ensemble, SpreadGroup sizes limited to perhaps 50-75 membersAnd individual processes limited to joining perhaps 50-75 groups (Spread: see next slide)Overheads soar as these sizes increaseEach group runs protocols oblivious of the others, and this creates huge inefficiencyPublish / Subscribe issue?We could have thousands of topics!Too many to directly map topics to groupsInstead map topics to a smaller set of groups.SPREAD system calls these “lightweight” groupsMapping will result in inaccuracies… Filter incoming messages to discard any not actually destined to the receiver processCornell’s new QuickSilver system will instead directly support immense numbers of groupsOther “toolkit” ideasWe could embed group communication into a framework in a “transparent” wayExample: CORBA fault-tolerance specification does lock-step replication of deterministic componentsThe client simply can’t see failuresBut
View Full Document