CS514: Intermediate Course in Operating SystemsReal-world time-critical systemsA “system of systems”Slide 4Examples: AmazonWhy is this “time critical”?AkamaiMilitary exampleAir Traffic Control ExampleSlide 10Issues? Let’s focus on scalingTempestSlide 13Slide 14How to solve such problems?Too many choices!How would Amazon answer?Best technology for Amazon?Consistency in TempestTempest CollectionsSlide 21Two level implementationEvaluationSlide 24ExperimentPerformanceDelay to order pending updatesRecovery under loadServices characteristicsPetStoreSlide 31SummaryWhat would an Air Traffic System want?Replicated componentsChoice we saw last timeMore choicesHow would we pick?Picking between Paxos and VsyncMore practical questionsChallenges of request duplicationThen….Raises a questionGeneralized questionSlide 44CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanVivek Vishnumurthy: TAReal-world time-critical systemsThe challenge:Suppose I need to build a rapidly responsive systemI want to handle large scaleI plan to use a modular architectureCan this be done in a web services setting?A “system of systems”We use the term “system of systems” or SoS to capture this conceptExamples will help clarify the ideaBasic structure:FrontEndBack endBack endBack endA “system of systems”Or might interconnect systems at different data centers to give a reasonably integrated “picture”FrontEndBack endBack endBack endFrontEndBack endBack endBack endExamples: AmazonAmazon would often use the front end to build a web page for a userThe back-end systems fill in contentProduct popularityCurrent inventoryGreat deals on related productsProducts other people who did a similar search ultimately purchased…Why is this “time critical”?Amazon is graded by quick accurate responseGood grade: You buy the bookBad grade: You use Google and shop elsewhereFor Amazon’s line of business, this SoS configuration is as critical as it gets!AkamaiCorporate site controls a large number of satellite systemsGoal: Move content to be close to users who are likely to access that contentTime critical aspect: Akamai is paid by hosts seeking to ensure snappy load times for their web sitesMilitary exampleTeam comes under fire, calls for helpCommander needs to knowWhat resources are available?What’s the terrainWhere have enemy forces been seen?Is there an evacuation option?… and needs a fast responseAir Traffic Control ExampleNew radar ping detectedTrack formation system should fit this to existing tracks (or create a new one)Flight plan lookup should check for known aircraft that might match this trackWarnings system should check for proximity rulesLong term planner should schedule a landing slotAir Traffic Control ExampleAlso see issues from controller to controllerWhen A hands off to B need to ensure continuous coverageAnd when centers talk to each otherFrance has 5 ATC centers… Europe has hundreds…Issues? Let’s focus on scalingScalability allows us to handle more load and also provides fault-toleranceEach service becomes a replicated group of servers that cooperateThey replicate data by multicasting updatesAnd the reads are load-balancedIssues are specific to time-criticality?Castor 4/07TempestStart with a standard web services applicationPerhaps, builds web pages for air traffic controllerWS front-endServicesServicesServicesCastor 4/07TempestWe’ll scale it out by replicating the components… and automate management, repair, adaptation even when faults occurWS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesCastor 4/07WS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesTempestThen interconnect data centersWS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesWS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesHow to solve such problems?Tools in our toolkitUDP multicast – very fast, unreliableRON – routes around problems, unreliableBitTorrent – receivers cooperate to offload work from the senderVirtual synchrony – strong consistencyQuorums – even stronger (but slower)CASD or Ricochet: real-time multicastToo many choices!Need to askHow strong does the consistency property need to be for the application of interest?How harsh is the runtime environment?How critical is timing?Is the system “safe” if the primitive is unreliable?How would Amazon answer?To guarantee fast response, they bought lots of hardware… now they damn well expect speedups!Selling a book that is actually out of stock isn’t a disasterFast matters more than “real time” of the provable, conservative kindBest technology for Amazon?Probably something like Ricochet would work best for themGets the update through FASTUses pro-active FEC to recover from likely patterns of lossBackground gossip mechanism repairs any losses not caught by FECHow might inconsistency “look” to users?Consistency in TempestRecall that transactional services offer strong data consistency modeleach read operation returns the result of the latest writeTempest implements a weaker model called sequential consistencyevery replica sees the operations on the same data item in the same orderorder may be different than the order updates were issuedTempest CollectionsPersistent service state = collection of objectsEach object (obj) is naturally represented by the tuple 〈 Histobj, Pendingobj〈Hist is the state of the objectcurrent value or list of updatesPending is the set of updates that cannot be applied yetapplied when ordering consistent acrossA Tempest ServiceA = sell(“IBM”, 108)B = sell(“IBM”, 163)C = buy(“IBM”, 32)Hist = Pending ={ F = sell(“IBM”, 81)E = sell(“IBM”, 76) }TempestCollectionReplica 1A = sell(“IBM”, 108)B = sell(“IBM”, 163)Hist = Pending ={ C = buy(“IBM”, 32)D = buy(“IBM”, 53)E = sell(“IBM”, 76) } TempestCollectionReplica 2A =
View Full Document