Workload and Failure Characterization

Home> Academic Documents> Workload and Failure Characterization

DOC PREVIEW

This preview shows page 1-2-3-4-5 out of 14 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 14 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Workload and Failure Characterization on aLarge-Scale Federated TestbedBrent N. ChunIntel Research [email protected] VahdatUniversity of California, San [email protected], a number of federated distributed computational and com-munication infrastructures have emerged, including the Grid, Plan-etLab, and Content Distribution Networks. In these environments,mutually distrustful autonomous domains pool resources togetherfor their mutual benefit, for instance to gain access to: unique com-putational resources, multiple vantage points on the network, ormore computation than available locally. Key challenges for suchfederated infrastructures include resource allocation, scheduling,and constructing highly available services in the face of faulty endhosts and unpredictable network behavior. Developing such appro-priate mechanisms and policies requires an understanding of theusage characteristics and operating environment of the target envi-ronment. In this paper, we present a detailed characterization ofthe actual use of the PlanetLab network testbed. PlanetLab con-sists of 240 nodes spread across 100 autonomous domains withover 500 active users. Using a variety of measurement tools, wepresent a three-month study on the network, CPU, memory anddisk usage of individual PlanetLab nodes and sites. On the con-sumer side, we further characterize the consumption of individualusers. Next, we present results on the availability and reliability ofsystem nodes and the network interconnecting them. Finally, wediscuss the implications of our measurements for emerging feder-ated environments.1. INTRODUCTIONA number of forces have contributed to the recent popularity offederated, distributed computation and communication infrastruc-tures. The vision of a computational grid [13] promises the abilityto leverage statistical multiplexing and unique hardware resourcesacross the network to carry out computations larger than might bepossible within any single site or administrative domain. Next,the advent of large-scale distributed systems such as distributedhash tables, peer-to-peer file sharing, and network measurementand characterization has lead a number of researchers to build per-sonal testbeds consisting of available machines at various points inthe network. These testbeds facilitate distributed system develop-ment, evaluation as well as network measurement and characteriza-tion. Given commonality in the requirements of this community—aset of available machines ata diversity of sites across the network—a variety of shared testbeds have been developed over the years.PlanetLab [18] is the latest, largest and perhaps most advancedof these testbeds. Finally, on the production side, companies de-ploying content distribution networks and shared hosting environ-ments are considering techniques to pool their resources acrosstrust boundaries to more cost effectively deliver high levels of per-formance and availability to their customers.These emerging federated testbeds are growing to significant size,geographic and administrative diversity. For instance, as of October2003, the PlanetLab infrastructure consisted of 240 nodes at over100 distinct administrative domains in 19 countries and 500 activeusers. As the size and reach of a federated infrastructure grows, im-portant challenges include resource discovery, scheduling, resourceallocation policies, and reliability among mutually distrustful usersand sites. Clearly, the appropriate mechanisms and policies dependon the exact usage characteristics of the system under considera-tion. For instance, if resources were never constrained, very simpleresource allocation policies would be appropriate.While such federated infrastructures are growing in popularity andimportance, there is currently little understanding of how the re-sources are actually used. Thus, the goal of this work is to char-acterize the aggregate, per-site, and per-user resource consumptioncharacteristics of the PlanetLab testbed. In addition, since systemresources are under the control and administration of a wide varietyof authorities, we also measured the availability of the testbed, withan eye toward the mechanisms appropriate for supporting reliablelarge-scale distributed systems.We instrumented the PlanetLab infrastructure to capture a broadrange of per-node characteristics and present the results of our studyover a three-month period, from July-October 2003. Our high-levelfindings include: i) the system transitions from periods of light loadto periods of heavy contention, ii) a small number of users accountfor the majority of system activity, iii) active distributed servicestypically remain active for less than 5 minutes, though some ser-vices remain active for weeks, iv) when averaged across a day andthe entire infrastructure, the majority of services consume less than1% of a single machine’s resources in aggregate, v) most nodesdemonstrate high levels of availability and low mean times to re-pair; however, the tail is long with 10% of nodes demonstratingextremely low levels of reliability, vi) node failures can be corre-lated significantly beyond the level predicted by correlation of nodefailures at a single site.Of course, we cannot claim that the specifics of our measurementsare representative of how such federated infrastructures may beused in general. However, we discuss the implications of our mea-surements for emerging PlanetLab infrastructure services in Sec-tion 6. Further, we believe that the general trends displayed by ourtestbed are likely to reflect at least some of the characteristics ofemerging federated distributed systems.2. PLANETLAB OVERVIEWOur study examines PlanetLab, an open, global network testbed fordeveloping, deploying and accessing widely distributed networkservices. The goal of PlanetLab is to grow to 1000 geographicallydistributed nodes situated in a variety of diverse locations on the In-ternet (e.g., colocation centers, edge sites, etc.). PlanetLab targetsservices that require broad geographic coverage for reasons includ-ing leveraging multiple vantage points on the network, providingphysical proximity to data sources and sinks, providing multipleindependent failure domains, and spanning multiple administrativeand political boundaries.In October 2003, the testbed consisted of 240 nodes at over 100sites in 19 countries. It has been in production use since July 2002,currently supports over 120 active research


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 14 pages.

Please select your school