Scalable Applications and Real Time ResponseSlide 2Real-timeReal problems need real-timeMore real-time problemsPredictabilityPredictability: ExamplesBack to the paperRole of coprocessorIN coprocessorSlide 11Present coprocessorGoals for coprocessorSS7 experimentIN coprocessor exampleOptions?Slide 17Slide 18Slide 19Slide 20Slide 21Results!!Next tryHand-coded schemeClever twistsHandling Failure and OverloadResultsOther settings with a strong temporal elementLoad balancing in farmsConclusionsFuture directions in real-timeDimensions of ScalabilityScalabilitySlide 34Slide 35Slide 36Slide 37Slide 38Slide 39ApproachesDangersTechnologiesYou’ve Got MailConventional Mail ServersPorcupine’s GoalsKey Techniques and RelationshipsPorcupine ArchitectureBasic Data StructuresPorcupine OperationsMeasurement EnvironmentPerformanceHow does Performance Scale?AvailabilitySoft-state ReconstructionHow does Porcupine React to Configuration Changes?Hard-state ReplicationHow Efficient is Replication?Slide 58Load balancing: Deciding where to store messagesHow Well does Porcupine Support Heterogeneous Clusters?ClaimsRetrospectSome Other Interesting PapersScalable Applications and Real Time ResponseAshish MotivalaCS 614April 17th 2001Scalable Applications and Real Time ResponseUsing Group Communication Technology to Implement a Reliable and Scalable Distributed IN Coprocessor; Roy Friedman and Ken Birman; TINA 1996.Manageability, availability and performance in Porcupine: a highly scalable, cluster-based mail service; Yasushi Saito, Brian N. Bershad and Henry M. Levy; Proceedings of the 17th ACM Symposium on Operating Systems Principles , 1999, Pages 1 – 15.Real-timeTwo categories of real-time–When an action needs to be predictably fast. i.e. Critical applications.–When an action must be taken before a time limit passes.More often than not real-time doesn’t mean “as fast as possible” but means “slow and steady”.Real problems need real-timeAir Traffic Control, Free Flight–when planes are at various locations.Medical Monitoring, Remote Tele-surgery–doctors talk about how patients responded after drug was given, or change therapy after some amount of time.Process control software, Robot actions–a process controller runs factory floors by coordinating machine tools activities.More real-time problemsVideo and multi-media systems–synchronous communication protocols that coordinate video, voice, and other data sourcesTelecommunications systems–guarantee real-time response despite failures, for example when switching telephone callsPredictabilityIf this is our goal…–Any well-behaved mechanism may be adequate–But we should be careful about uncommon disruptive cases•For example, cost of failure handling is often overlooked•Risk is that an infrequent scenario will be very costly when it occursPredictability: ExamplesProbabilistic multicast protocol–Very predictable if our desired latencies are larger than the expected convergence –Much less so if we seek latencies that bring us close to the expected latency of the protocol itselfBack to the paperTelephone networks need a mixture of properties–Real-time response–High performance–Stable behavior even when failures and recoveries occurCan we use our tools to solve such a problem?Role of coprocessorA simple database–Switch does a query•How should I route a call to 1800-327-2777 from 607-266-8141?•Reply: use output line 6–Time limit of 100ms on transactionCall ID, call conferencing, automatic transferring, voice menus, etcUpdate databaseIN coprocessorSS7switchSS7switchSS7switchSS7switchIN coprocessorSS7switchSS7switchSS7switchSS7switchcoprocessorcoprocessorcoprocessorcoprocessorPresent coprocessorRight now, people use hardware fault-tolerant machines for this–E.g. Stratus “pair and a spare” –Mimics one computer but tolerates hardware failures–Performance an issue?Goals for coprocessorRequirements–Scalability: ability to use a cluster of machines for the same task, with better performance when we use more nodes–Fault-tolerance: a crash or recovery shouldn’t disrupt the system–Real-time response: must satisfy the 100ms limit at all timesDowntime: any period when a series of requests might all be rejectedDesired: 7 to 9 nines availabilitySS7 experimentHorus runs the “800 number database” on a cluster of processors next to the switchProvide replication management toolsProvide failure detection and automatic configurationIN coprocessor exampleSS7 switchQuery Element (QE) processors do the number lookup (in-memory database).Goals: scalable memory without loss of processing performance as number of nodes is increasedSwitch itself asks for help when remote number call is sensedExternal adaptor (EA) processors run the query protocolEAEAPrimary backup scheme adapted (using small Horus process groups) to provide fault-tolerance with real-time guaranteesOptions?A simple scheme:–Organize nodes as groups of 2 processes–Use virtual synchrony multicast•For query•For response•Also for updates and membership trackingIN coprocessor exampleSS7 switchEAEAStep 1: Switch sees incoming requestIN coprocessor exampleSS7 switchEAEAStep 2: Switch waits while EA procs. multicast request to group of query elements (“partitioned” database)IN coprocessor exampleSS7 switchThinkThinkEAEAStep 3: The query elements do the query in duplicateIN coprocessor exampleSS7 switchEAEAStep 4: They reply to the group of EA processesIN coprocessor exampleSS7 switchEAEAStep 5: EA processes reply to switch, which routes callResults!!Terrible performance!–Solution has 2 Horus multicasts on each critical path–Experience: about 600 queries per second but no moreAlso: slow to handle failures–Freezes for as long as 6 secondsPerformance doesn’t improve much with scale eitherNext tryConsider taking Horus off the critical pathIdea is to continue using Horus–It manages groups–And we use it for updates to the database and for partitioning the QE setBut no multicasts on critical path–Instead use a hand-coded schemeUse Sender Ordering (or fifo) instead of Total OrderingHand-coded schemeQueue up a set of requests from an EA to a QEPeriodically (15 ms), sweep the set into a message and send as a batch Process queries also as a batchSend the batch of replies back to EAClever twistsSplit into a
View Full Document