Princeton COS 592 - Operating System Failures

Unformatted text preview:

Florin Sultan,Aniruddha Bohra,Stephen Smaldone,and Yufei PanRutgers UniversityPascal GallardIRISA/INRIAIulian NeamtiuUniversity of Maryland, College ParkLiviu IftodeRutgers UniversityCurrent Internet service architectures lack support for salvaging stateful clientsessions when the underlying operating system fails due to hangs, crashes,deadlocks, or panics.The Backdoors (BD) system is designed to detect suchfailures and recover service sessions in clusters of Internet servers by extractinglightweight state associated with client service sessions from server memory.TheBD architecture combines hardware and software mechanisms to enable accuratemonitoring and remote healing actions, even in the presence of failures thatrender a system unavailable.Critical Internet services such as e-commerce, online auctions, andbanking run on complex, multi-tierarchitectures built with commodity (off-the-shelf) machines and operating sys-tems. These stateful services are sensitiveto server failures: active client sessions onthese servers are lost, although the stateassociated with them might still be intactin a failed machine’s memory.We developed a recovery approachthat exploits hardware and softwareredundancy in Internet service installa-tions to reuse active clients’ session stateafter OS failures (http://discolab.rutgers.edu/bda). Our lightweight, appli-cation-independent system provides bothfailure detection and recovery, for usewith complex, multi-tier Internet services.The core of the system is the novel Back-doors (BD) architecture,1which uses com-modity programmable network interfacecards (NICs) with specialized firmwareand OS extensions to provide remoteaccess to lightweight application and OSstate in a machine’s memory withoutrelying on its OS or processors. Using BD,machines in an Internet server cluster cancooperatively observe each other’s health,detect failures, and take over client ses-sions from failed nodes. In this article, we describe the BD archi-tecture and our OS extensions for monitor-ing and recovery of service sessions. Wehave implemented a prototype in theFreeBSD 4.8 kernel, using Myrinet Lanai-XP programmable NICs (www.myri.com).The results from our experiments with theRice University Bidding System (Rubis;http://rubis.objectweb.org), a cluster-based24 MARCH • APRIL 2005 Published by the IEEE Computer Society 1089-7801/05/$20.00 © 2005 IEEE IEEE INTERNET COMPUTINGRecovery-Oriented ComputingRecovering Internet Service Sessions fromOperating System Failuresmulti-tier Internet auction service modeled aftereBay, indicate that our approach is nonintrusive andeffective. Indeed, the system can recover all servicesessions from failed nodes in both the front-end andmiddle tier within 25 milliseconds.Motivation and ApproachToday’s Internet services employ servers organizedin clustered multi-tier architectures in which mul-tiple nodes perform processing for a given clientsession: • front-end nodes handle HTTP requests,• mid-tier nodes implement application logic,and• back-end nodes run database servers.Machines in all tiers run commodity, general-pur-pose OSs, which typically cannot tolerate failurescaused by OS bugs or misconfiguration.An OS failure renders an entire system unus-able because applications depend on core OS ser-vices for memory allocation, process management,and I/O. For noncritical or stateless services,rebooting would be sufficient for recovery. More-over, if the service is idempotent — generating thesame outcome in response to multiple copies of agiven request — clients can recover by simply reis-suing their requests. Yet, the reboot approach pre-sents at least two problems:• reboots are destructive to currently executingtransactions, forcing the clients to reissuethem, and• reboots are disruptive, incurring downtime forboth the service provider and clients.While most applications and their clients cantolerate the side effects of a reboot, such anapproach can be unacceptable for the critical,transaction-oriented services. Depending on aserver’s load-balancing and admission policies,clients are not guaranteed readmission to resumetheir sessions. In addition, the service might pro-vide guarantees that include uninterrupted deliv-ery — at least to the extent the network permits.We designed the BD architecture to supportremote healing,1,2using remote memory access todetect failures and perform automated recoveryactions. BD relies on a specialized network inter-face that allows external access to a computer’sresources (memory, I/O devices, and so on) with-out involving its processors or OS. This allows therecovery actions to be deferred until after the fail-ure has occurred, adds only negligible overheadduring normal (failure-free) operation, and pro-vides fast recovery. (For an overview, see “TheBackdoors Architecture” sidebar.)Remote Recovery with BackdoorsRemote recovery with BD addresses system-hangfailures, in which a server cannot execute usefulwork because the OS is unresponsive. Such failuresare caused by faulty OSs — for example, due to dri-vers that leave interrupts disabled, deadlocks, ormisplaced panics. Our goal is to reliably detectIEEE INTERNET COMPUTING www.computer.org/internet/ MARCH • APRIL 2005 25Web Service Conversation ModelingThe Backdoors ArchitectureAccording to the Telecom Glossary 2000ANSI standard,a backdoor is “a hiddensoftware or hardware mechanism, usuallycreated for testing and troubleshooting.”1The Backdoors architecture goes a step fur-ther by providing an alternate path into asystem to enable automated remote healing(recovery or repair) operations.To implement BD, we use intelligent(programmable) network interface cards (I-NICs) with remote direct memory access(RDMA) capability, which allows a machineto access another’s memory for readingand writing without involving its proces-sor(s). RDMA makes BD nonintrusive tosystem activity during normal operation,and robust to OS failures.Figure A shows how our model parti-tions a computer between front-door com-ponents, which are under the OS’s control,run OS or application code, and interactwith the outside world, and backdoor com-ponents, which are involved in monitoringand recovery operations. Our crucialassumptions are that the backdoor hard-ware remains available after OS failures andmemory contents remain valid and acces-sible over the system bus.Reference1. Telecom Glossary 2000 T1.523-2001,Am. Nat’l Stan-dard Institute, 2001;


View Full Document

Princeton COS 592 - Operating System Failures

Download Operating System Failures
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Operating System Failures and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Operating System Failures 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?