1CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanVivek Vishnumurthy: TAReal-world time-critical systems The challenge: Suppose I need to build a rapidly responsive systempy I want to handle large scale I plan to use a modular architecture Can this be done in a web services setting?A “system of systems” We use the term “system of systems” or SoS to capture this conceptExamples will help clarify the ideaExamples will help clarify the idea Basic structure:FrontEndBack endBack endBack endA “system of systems” Or might interconnect systems at different data centers to give a reasonably integrated“picture”reasonably integrated pictureFrontEndBack endBack endBack endFrontEndBack endBack endBack endExamples: Amazon Amazon would often use the front end to build a web page for a userThe back-end systems fill in contentThe back-end systems fill in content Product popularity Current inventory Great deals on related products Products other people who did a similar search ultimately purchased…Why is this “time critical”? Amazon is graded by quick accurate responseGood grade: You buy the bookGood grade: You buy the book Bad grade: You use Google and shop elsewhere For Amazon’s line of business, this SoS configuration is as critical as it gets!2Akamai Corporate site controls a large number of satellite systemsGoal: Move content to be close to usersGoal: Move content to be close to users who are likely to access that content Time critical aspect: Akamai is paid by hosts seeking to ensure snappy load times for their web sitesMilitary example Team comes under fire, calls for help Commander needs to knowWhat resources are available?What resources are available? What’s the terrain Where have enemy forces been seen? Is there an evacuation option? … and needs a fast responseAir Traffic Control Example New radar ping detected Track formation system should fit this to existing tracks (or create a new one)g( ) Flight plan lookup should check for known aircraft that might match this track Warnings system should check for proximity rules Long term planner should schedule a landing slotAir Traffic Control Example Also see issues from controller to controllerWhen A hands off to B need to ensureWhen A hands off to B need to ensure continuous coverage And when centers talk to each other France has 5 ATC centers… Europe has hundreds…Issues? Let’s focus on scaling Scalability allows us to handle more load and also provides fault-toleranceEach service becomes a replicated group ofEach service becomes a replicated group of servers that cooperate They replicate data by multicasting updates And the reads are load-balanced Issues are specific to time-criticality?Tempest Start with a standard web services applicationPerhaps builds webServicesCastor 4/07Perhaps, builds web pages for air traffic controllerWS front-endServicesServices3Tempest We’ll scale it out by replicating the components… and automate management, repair, adaptation even when faults occurCastor 4/07when faults occurWS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesTempest Then interconnect data centersCastor 4/07WS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesWS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesWS front-endServicesServicesServicesWS front-endWS front-endWS front-endWS front-endServicesServicesServicesServicesServicesServicesServicesServicesServicesHow to solve such problems? Tools in our toolkit UDP multicast – very fast, unreliableRON–routes around problems unreliableRON routes around problems, unreliable BitTorrent – receivers cooperate to offload work from the sender Virtual synchrony – strong consistency Quorums – even stronger (but slower) CASD or Ricochet: real-time multicastToo many choices! Need to ask How strong does the consistency property need to be for the application of interest?pp How harsh is the runtime environment? How critical is timing? Is the system “safe” if the primitive is unreliable?How would Amazon answer? To guarantee fast response, they bought lots of hardwarenow they damn well expect speedups!… now they damn well expect speedups! Selling a book that is actually out of stock isn’t a disaster Fast matters more than “real time” of the provable, conservative kindBest technology for Amazon? Probably something like Ricochet would work best for themGets the update through FASTGets the update through FAST Uses pro-active FEC to recover from likely patterns of loss Background gossip mechanism repairs any losses not caught by FEC How might inconsistency “look” to users?4Consistency in Tempest Recall that transactional services offer strong data consistency model each read operation returns the result of the latest write Tempest implements a weaker model called sequential consistency every replica sees the operations on the same data item in the same order order may be different than the order updates were issuedTempest Collections Persistent service state = collection of objects Each object (obj) is naturally tdb th t l〈Hi trepresented by the tuple〈Histobj, Pendingobj〉Histis the state of the object current value or list of updatesPendingis the set of updates that cannot be applied yet applied when ordering consistent acrossA Tempest ServiceA = sell(“IBM”, 108)B = sell(“IBM”, 163)C = buy(“IBM”, 32)Hist= TempestCollectionA = sell(“IBM”, 108)B = sell(“IBM”, 163)Hist= Pending=TempestCollectionPending={ F = sell(“IBM”, 81)E = sell(“IBM”, 76) }Replica 1g{ C = buy(“IBM”, 32)D = buy(“IBM”, 53)E = sell(“IBM”, 76) } Replica 2A = sell(“IBM”, 108)B = sell(“IBM”, 163)Hist= Pending={ G = buy(“IBM”, 10) } C = buy(“IBM”, 32)TempestCollectionReplica 3Two level implementation To do a read, load-balance on some randomly picked component Access the persistent state of the collection To do a write Multicast the
View Full Document