DOC PREVIEW
Berkeley COMPSCI 294 - Failure Trends in a Large Disk Drive Population

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Appears in the Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), February 2007Failure Trends in a Large Disk Drive PopulationEduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´e BarrosoGoogle Inc.1600 Amphitheatre PkwyMountain View, CA 94043{edpin,wolf,luiz}@google.comAbstractIt is estimated that over 90% of all new information producedin the world is being stored on magnetic media, most of it onhard disk drives. Despite their importance, there is relativelylittle published work on the failure patterns of disk drives, andthe key factors that affect their lifetime. Most available dataare either based on extrapolation from accelerated aging exper-iments or from relatively modest sized field studies. Moreover,larger population studies rarely have the infrastructure in placeto collect health signals from components in operation, whichis critical information for detailed failure analysis.We present data collected from detailed observations of alarge disk drive population in a production Internet services de-ployment. The population observed is many times larger thanthat of previous studies. In addition to presenting failure statis-tics, we analyze the correlation between failures and severalparameters generally believed to impact longevity.Our analysis identifies several parameters from the drive’sself monitoring facility (SMART) that correlate highly withfailures. Despite this high correlation, we conclude that mod-els based on SMART parameters alone are unlikely to be usefulfor predicting individual drive failures. Surprisingly, we foundthat temperature and activity levels were much less correlatedwith drive failures than previously reported.1 IntroductionThe tremendous advances in low-cost, high-capacitymagnetic disk drives have been among the key factorshelping establish a modern society that is deeply relianton information technology. High-volume, consumer-grade disk drives have become such a successful prod-uct that their deployments range from home computersand appliances to large-scale server farms. In 2002, forexample, it was estimated that over 90% of all new in-formation produced was stored on magnetic media, mostof it being hard disk drives [12]. It is therefore criticalto improve our understanding of how robust these com-ponents are and what main factors are associated withfailures. Such understanding can be particularly usefulfor guiding the design of storage systems as well as de-vising deployment and maintenance strategies.Despite the importance of the subject, there are veryfew published studies on failure characteristics of diskdrives. Most of the available information comes fromthe disk manufacturers themselves [2]. Their data aretypically based on extrapolation from accelerated lifetest data of small populations or from returned unitdatabases. Accelerated life tests, although useful in pro-viding insight into how some environmental factors canaffect disk drive lifetime, have been known to be poorpredictors of actual failure rates as seen by customersin the field [7]. Statistics from returned units are typi-cally based on much larger populations, but since thereis little or no visibility into the deployment characteris-tics, the analysis lacks valuable insight into what actu-ally happened to the drive during operation. In addition,since units are typically returned during the warranty pe-riod (often three years or less), manufacturers’ databasesmay not be as helpful for the study of long-term effects.A few recent studies have shed some light on fieldfailure behavior of disk drives [6, 7, 9, 16, 17, 19, 20].However, these studies have either reported on relativelymodest populations or did not monitor the disks closelyenough during deployment to provide insights into thefactors that might be associated with failures.Disk drives are generally very reliable but they arealso very complex components. This combinationmeans that although they fail rarely, when they do fail,the possible causes of failure can be numerous. As aresult, detailed studies of very large populations are theonly way to collect enough failure statistics to enablemeaningful conclusions. In this paper we present onesuch study by examining the population of hard drivesunder deployment within Google’s computing infras-tructure.We have built an infrastructure that collects vital in-formation about all Google’s systems every few min-utes, and a repository that stores these data in time-series format (essentially forever) for further analysis.The information collected includes environmental fac-tors (such as temperatures), activity levels and many ofthe Self-Monitoring Analysis and Reporting Technology(SMART) parameters that are believed to be good indi-cators of disk drive health. We mine through these dataand attempt to find evidence that corroborates or con-tradicts many of the commonly held beliefs about howvarious factors can affect disk drive lifetime.Our paper is unique in that it is based on data from adisk population size that is typically only available fromvendor warranty databases, but has the depth of deploy-ment visibility and detailed lifetime follow-up that onlyan end-user study can provide. Our key findings are:• Contrary to previously reported results, we foundvery little correlation between failure rates and ei-ther elevated temperature or activity levels.• Some SMART parameters (scan errors, realloca-tion counts, offline reallocation counts, and proba-tional counts) have a large impact on failure proba-bility.• Given the lack of occurrence of predictive SMARTsignals on a large fraction of failed drives, it is un-likely that an accurate predictive failure model canbe built based on these signals alone.2 BackgroundIn this section we describe the infrastructure that wasused to gather and process the data used in this study,the types of disk drives included in the analysis, and in-formation on how they are deployed.2.1 The System Health InfrastructureThe System Health infrastructure is a large distributedsoftware system that collects and stores hundreds ofattribute-value pairs from all of Google’s servers, andprovides the interface for arbitrary analysis jobs to pro-cess that data.The architecture of the System Health infrastructureis shown in Figure 1. It consists of a data collectionlayer, a distributed repository and an analysis frame-work. The collection layer is responsible for getting in-formation from each of thousands of


View Full Document

Berkeley COMPSCI 294 - Failure Trends in a Large Disk Drive Population

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Failure Trends in a Large Disk Drive Population
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Failure Trends in a Large Disk Drive Population and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Failure Trends in a Large Disk Drive Population 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?