DOC PREVIEW
Berkeley COMPSCI 294 - Measuring System and Software Reliability using an Automated Data Collection Process

This preview shows page 1-2-3-4 out of 12 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 12 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Measuring System and Software Reliability using an Automated Data Collection Process Brendan Murphy Ted Gent Digital Equipment Scotland Ltd Digital Equipment Corporation Mosshill Industrial Estate 129 Parker Street Ayr KA6 6BE Maynard Scotland MA 01754-2198 USA [email protected] [email protected] Summary The factors which impact the behaviour of the customers computing environment, which is undergoing a revolution away from a server or timeshare centric model to a client/server or distributed model, can no longer be identified solely through using traditional methods of data collection. Digital Equipment Corporation has developed an automated data collection process, collecting on-system data logging information from customer sites that has yielded consistent, quantitative, high integrity information. This information has been used to pro-actively focus on direct product and process improvements. This paper describes the on-system data logging process and analysis methodology used by Digital to measure system, product and operating system reliability, proving examples of the application of the techniques and provides insight into the causes of failures. Key Words: System Reliability, Software Reliability, Automated Data Collection, Customer Survey, Event Logging, Operating System Reliability. 1. Introduction Digital has been monitoring systems in the field, using a variety of data collection techniques, for over 15 years. During that period significant changes have occurred in the reliability profile of systems and in the operating and development environments. Examples of changes in the reliability profile of VAX systems are described in this paper. In addition, increasing competitive pressures are reducing development cycle times and development budgets which makes it imperative to focus on product quality improvements in areas with the greatest impact on customer satisfaction. It is essential that a data collection and analysis system provide information contributing to design direction and trade-offs. A number of traditional methods are available for providing performance feedback and the strength and weaknesses of each method is examined in this paper. Due to the limitations of these methodologies, Digital has developed an on-line data capture process which provides data to product design, manufacturing and services organisations to continuously improve the reliability of Digital products and systems. This paper provides a detailed description of the on-line data capture process and the techniques applied to analyse the data. Digital has successfully used this process for a number of years resulting in a substantial amount of behavioural information being available for analysis. This paper provides examples of some of the information captured through this process and describes how the process measures the reliability of systems and versions of operating systems. 2. Changes in System Reliability In the 1970’s and early 1980’s, the majority of customer systems were stand alone servers driving non-intelligent terminals. The reliability of the hardware and the operating systems were the significant factors impacting the performance of these systems. Customer sites were mainly homogeneous with systems managed by MIS departments specialising in particular product sets. Significant changes have occurred to both the environment into which the systems are configured and the reliability profiles of individual products. Digital has been monitoring these changes and measuring their impact on its product sets. This section describes these changes using, as an example, information collected from VAX systems on customer sites.Measuring System and Software Reliability using an Automated Data Collection Process Quality & Reliability Brendan Murphy & Ted Gent DPP (Digital Product Performance) 2 2.1 Changes in Product Reliability During the late 1970's and early 1980's, the reliability of hardware and the operating system were the major contributors to system outages. In 1985 these factors accounted for 70% of all system crashes occurring on Digitals’ VAX systems on customer sites. Other factors causing system crashes were not measured at that time. They were not viewed as being significant (Figure 1. estimates the impact of other factors causing system crashes in 1985). Over the last 10 years, the reliability of both hardware and operating systems has dramatically increased. Improvements to hardware reliability is primarily due to the use of large scale integration and the wide scale adoption of Computer Aided Design processes. Improvements in the reliability of operating systems are due to a shift in focus from adding pure functionality to balancing any added functionality with reliability and recoverability attributes. Customers and product suppliers continue to view the hardware failure rate as an important measure of system reliability, in spite of the changes in the profile of system crashes. This may be due to (a) hardware failures having a significant impact on the Customer operation, (b) the lack of an acceptable industry standard to measure reliability in terms of the rate of system interruptions, or (c) hardware failures resulting in a significant cost to the service providers through both part replacement and the cost of the service engineers visiting the customer site. Figure 1: Cause of System Crashes Root causal analysis of system crashes performed on VAX systems has identified the increasing impact that system management problems have on the system crash rate. In 1993, over 50% of system crashes on VAX systems were due to system management problems. The crash types classified within this category are: (a) Crashes resulting from system management actions. Examples are the incorrect setting of system parameters, the incorrect installation of applications, the incorrect configuration of systems etc., and (b) Multiple crashes resulting from one problem. Crashes increasingly occur after a disruption to the system. The system manager may ignore the initial crash and only address the problem as a result of subsequent crashes. Multiple crashes also


View Full Document

Berkeley COMPSCI 294 - Measuring System and Software Reliability using an Automated Data Collection Process

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Measuring System and Software Reliability using an Automated Data Collection Process
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Measuring System and Software Reliability using an Automated Data Collection Process and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Measuring System and Software Reliability using an Automated Data Collection Process 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?