Maintaining Network QoS Across NIC Device Driver Failures

Home> Academic Documents> Maintaining Network QoS Across NIC Device Driver Failures

DOC PREVIEW

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

8th IEEE International Symposium on Networ k Computing and ApplicationsCambr idge,MA, July 2009.Maintaining Network QoS Across NIC Device DriverFailures Using VirtualizationMichael Le, AndrewGallagher,Yuval Tamir Yoshio TurnerConcurrent Systems Laboratory HP LaboratoriesUCLA Computer Science Department Palo Alto, CA{mvle,ajcg,tamir}@cs.ucla.edu [email protected]—Device driverfailures have been shown to be a majorcause of system failures. Network services stress NIC devicedrivers, increasing the probability of NIC driverbugs beingmanifested as serverfailures. System virtualization isincreasingly used for serverconsolidation and management. Theisolated driverdomain (IDD) architectureused by several virtualmachine monitors, such as Xen, forms a natural foundation formaking systems resilient to NIC driverfailures. In order torealize this potential, recovery must be fast enough to maintainQoS for network services across NIC driverfailures. Weshowthat the standard Xen configuration, enhanced with simpledetection and recovery mechanisms, cannot provide such QoS.However, with NIC drivers isolated in twovirtual machines, in aprimary/warm-spareconfiguration, the system can recoverfroman overwhelming majority of NIC driverfailures in under 10ms.I. IntroductionAsignificant fraction of bugs in operating systems arefound in device drivers [3]. Thus, errors in drivers are a majorcause of system failures [18]. Afaulty device drivercan causethe entire system to crash, hang, or exhibit arbitrary incorrectbehavior.Inorder to improve the reliability of systems,drivers must be isolated, limiting their ability to corrupt otherparts of the system[18]. Furthermore, the system must be ableto detect erroneous driverbehavior and recoverbyrestoring aworking driver. Incurrent systems, since a faulty drivercancorrupt the entire system, recovery is likely to require acomplete system reboot as well as recovery of the applicationstate. For manyapplications, such as most network services,lengthyservice interruption is unacceptable.The resiliencyofsystems to driverfailures can beimprovedbyisolating drivers in light-weight domains [18],and by user-leveldrivers [9]. Isolating the device drivers fromthe kernel prevents buggy drivers from harming the kernel andcrashing the system. Device driverrecovery is done byrestarting and re-attaching the device drivertothe runningkernel. These approaches require the kernel and device driversbe be modified or the use of non-standard device drivers.System virtualization[16] is nowwidely used in datacenters to provide workload isolation and flexiblemanagement of consolidated servers [2]. Several virtualmachine monitors (VMMs) use an isolated driverdomain(IDD) architecture to virtualize I/O devices [4, 12, 14]. Withthe IDD architecture, unmodified commodity device drivers(e.g., NIC drivers) run in a different virtual machine (VM)from applications. The IDD architecture does not eliminatethe ability of a malicious device drivertoprevent correctexecution — for example, a NIC drivercan drop all packets.However, the IDD architecture has the potential to preventmost non-malicious device driverfailures from corruptingother VMs[4]. Since system virtualization is commonly usedin data centers for other reasons, there is strong motivation toutilize the IDD architecture for resiliencytodriverfailureswithout resorting to special or modified drivers.Unfortunately,without additional mechanisms,virtualization utilizing the IDD architecture is not sufficient toallowapplications to continue uninterrupted across driverfailures. On the contrary,with virtualization, the effects offailed drivers are worse since a single device driverfailure canimpact manyVMs sharing the device. When drivers reside inaprivileged VM, such as Dom0 in the Xen VMM, the entirevirtualized system, including all the application VMs, must berestarted if the drivers crash the privileged VM. Even in aconfiguration where device drivers reside in separate non-privileged VMs[4], failure of a driverVMcauses all VMssharing the device exported by the driverVMtostop working.TableI. Impact of fault injection in NIC device driver.#injections %application failuresystemconfigurationLinux 1987 66.0%Xen-base 2574 66.1%Xen-IDD 2809 63.4%To illustrate the point above reg a rding the IDDarchitecture, Table I shows the results from fault injection intoaNIC device driver(see Sections III and IV for details). Theapplication is a simple user-level‘‘ping’’program between aseparate physical system and the target system. Results areshown for a target system that is Linux without virtualization,Linux in an application VM on a standard Xen configuration(Xen-base) where the driverisinthe privileged VM, andLinux in an application VM on a Xen configuration with aseparate driverVM(Xen-IDD). In all three cases a similarfraction of injected faults caused the application to fail.The focus of this paper is on achieving resiliencytoNICdriverfailures in virtualized systems using the IDDarchitecture. With the Xen VMM, we describe and evaluateseveral mechanisms that provide detection and recovery fromNIC driverfailure. Our evaluation is based on injecting faultsin the drivercode and measuring network service interruption.We present a driverfailure detection mechanism that iscapable of detecting hangs as well as crashes. Our resultsshowthat simply rebooting the VM with the NIC driverwhendriverfailure is detected results in recovery delay of multiple-2-seconds and thus cannot provide transparent recovery fornetwork services.We present a fast recovery mechanism based onmaintaining twoVMs with the NIC driver: the primary and a‘‘warm spare.’’ When driverfailure is detected, recovery isperformed by replacing the primary with the spare. Whileprevious schemes reported recovery times on the order of asecond [17] or hundreds of milliseconds [4], in the greatmajority of cases our scheme recovers in less than 10ms. Suchfast recovery allows the system to meet QoS requirements formanynetwork services. This is accomplished usingunmodified application VMs, minimal modifications to theXen VMM and driverVMkernel, and user-levelscripts in theprivileged VM. The scheme incurs no performance overheadand insignificant memory overhead during normal operation.While our evaluation is based mainly on the Intel Pro 100100Mb NIC, we have also validated our mechanism usinganother 100Mb NIC and a 1Gb NIC. With 100Mb NICs, nodevice drivermodifications were required.


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

Please select your school