Unformatted text preview:

Hypervisor based Fault tolerance Thomas C Bressoud Isis Distributed Systems 55 Fairbanks Blvd Marlborough MA 01752 Fred B Schneider Computer Ithaca New York Abstract Protocols to implement a fault tolerant computing system are described These protocols augment the hypervisor of a virtttalmachine manager and coordinate a primary virtual machine with its backup The result is a fault tolerant computing system No modification to hardware operating system or application programs is required A prototype system was constructed for HP s PA RISC instruction set architecture The prototype was able to run programs about a factor of 2 slower than a bare machine Adding replica coordination to an tem is bound to be difficult since tems are invariably complicated tions must be devised for every ported by a given platform 3 If replica coordination is left to the application programmer then the same problems must be solved by the programmers of every application Moreover all of these programmers must be acquainted with the nuances of replica coordination Or all of the programmers must be constrained to use a given interface e g causal group broadcasts or abstraction e g transactions 1 Introduction One popular scheme for implementing fault tolerance involves replicating a computation on processors that fail independently Replicas are coordinated so that they execute the same sequence of instructions and produce the same results This paper describes a novel implementation of that scheme We interpose a software layer between the hardware and the operating system The result is a fault tolerant computing system A hypervisor is a software layer that implements viriual having the same instruction set architecture as the hardware on which the hypervisor executes Because the virtual machine s instruction set architecture is indistinguishable from the bare hardware software run on a virtual machine cannot tell whether a hypervisor is present Perhaps the best known hypervisor is CP 67 MS70 developed by IBM Corp for 360 67 and later evolved into VM 370 IBM72 for System 370 mainframes Hypervisors for other machines have also been constructed machines The benefits of our approach concern engineering and time to market costs We are driven by two observations First for a given instruction set architecture a manufacturer typically will build a series of realizations where cost performance improves over the series Second implementing replica coordination is subtle whether done by hardware or software Given these observations we note the following three problems PK751 K82 An excellent machines appears in G74 This material is based on work supported in part by the OfNaval Research ARPA NSF Grant NAG 2 893 and AFOSR ions findings this publication No under contract CCR 9014363 NOO014 91 J 1219 NASAJARPA grant F49620 94 1 0198 and conclusions or recommendations Any grant of research on virtual Use of a hypervisor to implement replica coordination is attractive at least in theory since it addresses the three problems described above Replica coordination implemented in a hypervisor instantly becomes available to all hardware realizations of the given instruction set architecture including realizations that did not exist when the hypervisor was written This addresses problem 1 For problem 2 we observe that implementing replica coordination in hardware means that a single implementation will suffice for every operating system that executes on that instruction set architecture Finally problem 3 is addressed because implementing replica coordination in a hypervisor frees the application programmer from this task without imposing an interface or abstraction opin expressed in are those of the authors and do not reflect survey There are a variety of reasons for using a hypervisor A hypervisor allows multiple operating systems or multiple versions of the same operating system to coexist on a single hardware processor Even when virtual machines all execute the same operating system a hypervisor provides an isolation that simplifies protection and sbanng PK74 K82 Our research is not concerned with the virtues and costs of hypervisors though We are concerned with the virtues and costs of augmenting a hypervisor to support replica coordination and in that manner support fault tolerance When replica coordination is implemented in hardware a design cost is incurred for each new realization of the Because designing replica coordination architecture hardware takes time support for fault tolerance necessarily lags behind the cost performance curve of existing operating sysmature operating sysIn addition modificaoperating system sup These difficulties caused us to explore alternatives to the hardware the operating system and the application programs as the place for implementing replica coordination in a computing system whose implementation did not require modifications to hardware to the operating system nor to any application software fice 14853 2 would 1 Science Department Cornell University the views of these agencies Permission to make digital hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage the copyright notice the title of the publication end its date appear and notice is given that copying is by permission of ACM Inc To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee SIGOPS 95 12J95 CO USA CI 1995 ACM 0 89791 715 419510012 3 50 1 The question then is whether hypervisor based coordination is practical What is the performance for a virtual machine comprises accessible to that virtual machine replicapenalty The environment This paper addresses these issues by describing the protocolsl and performance of a prototype implementation of hypervisorbased fault tolerance The prototype executes programs about a factor of 2 slower than a bare machine would Our protocols use a single backup and implement a l fault tolerant virtual machine generalization to t fault tolerant virtual machines is straightforward The protocols cause the backup virtual machine to execute exactly the same sequence of instructions as the primary virtual machine where each instruction executed by the backup has the same effect as when it is executed by the primary The protocols also ensure that the environment does not see an anomalous sequence of I O requests if the


View Full Document

MIT 6 824 - Hypervisor-based Fault-tolerance

Documents in this Course
Logging

Logging

4 pages

Load more
Loading Unlocking...
Login

Join to view Hypervisor-based Fault-tolerance and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hypervisor-based Fault-tolerance and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?