FAB Building Distributed Enterprise Disk Arrays from Commodity Components Yasushi Saito Svend Fr lund Alistair Veitch Arif Merchant Susan Spence Hewlett Packard Laboratories firstname lastname hp com ABSTRACT This paper describes the design implementation and evaluation of a Federated Array of Bricks FAB a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability FAB is built from a collection of bricks small storage appliances containing commodity disks CPU NVRAM and network interface cards FAB deploys a new majority votingbased algorithm to replicate or erasure code logical blocks across bricks and a reconfiguration algorithm to move data in the background when bricks are added or decommissioned We argue that voting is practical and necessary for reliable high throughput storage systems such as FAB We have implemented a FAB prototype on a 22 node Linux cluster This prototype sustains 85MB second of throughput for a database workload and 270MB second for a bulk read workload In addition it can outperform traditional masterslave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client requests Categories and Subject Descriptors D 4 5 Software Operating systems Reliability C 5 5 Computer system implementation Servers H 3 4 Information storage and retrieval Systems and software Distributed systems General Terms Algorithms Management Performance Reliability Keywords Storage disk array replication erasure coding voting consensus 1 INTRODUCTION A Federated Array of Bricks FAB is a distributed disk array that provides reliable accesses to logical volumes using only commodity hardware It solves the two problems scalability and cost associated with traditional monolithic disk arrays Traditional disk arrays drive collections of disks using centralized controllers They achieve reliability via highly customized Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee ASPLOS 04 October 7 13 2004 Boston Massachusetts USA Copyright 2004 ACM 1 58113 804 0 04 0010 5 00 redundant and hot swappable hardware components They do not scale well because there is a high up front cost for even a minimally configured array and a single system can only grow to a limited size These limitations force manufacturers to develop multiple products for different system scales which multiplies the engineering efforts required These issues coupled with relatively low manufacturing volumes drive up their cost high end arrays retail for many millions of dollars at least 20 times more than the price of consumer class systems with equivalent capacity FAB consists of a collection of bricks small rack mounted computers built from commodity disks CPU and NVRAM connected by standard networks such as Ethernet Bricks autonomously distribute data and functionality across the system to present a highly available set of logical volumes to clients through standard diskaccess interfaces such as iSCSI 32 FAB can scale incrementally starting from just a few bricks and adding more bricks as demand grows up to several hundred bricks It is also cheaper than traditional arrays due to the economies of scale inherent in high volume production a brick with 12 SATA disks and 1GB of NVRAM can be built for less than 2000 with a total system cost of about 20 to 80 of traditional arrays even with three way replication Commodity hardware is of course far less reliable than its enterprise counterparts Using the reliability figures reported in 4 3 we expect the mean time between failures of a typical network switch to be 4 years and that of a typical brick to be 4 to 30 years depending on the quality of disks and the internal disk organization e g RAID 5 is more reliable than RAID 0 FAB inevitably faces frequent changes to the system including brick failures or additions and network partitioning The FAB project tries to achieve two goals in such environments First FAB should provide continuous service masking failures transparently and ensuring stable performance over diverse workloads Second it should ensure high reliability comparable to that of today s high end disk arrays 10 000 mean years before the first data loss tolerating the failures of disks CPUs or networks The key idea behind FAB to achieve these goals is replication and erasure coding by voting Acting on behalf of a client a read or write request coordinator communicates with a subset quorum of bricks that store the data Voting allows FAB to tolerate failed bricks and network partitioning safely without blocking It also enables performance decoupling 24 tolerating overloaded bricks by simply ignoring them as long as others are responsive This is especially effective in systems like FAB in which brick response times fluctuate due to the randomness inherent in disk head mechanisms Voting based replication is not new but it has seen little use in high throughput systems because of concerns about inefficiency as reading data must involve multiple remote nodes 36 In this paper we show that voting is indeed practical and often necessary for reliable high throughput storage systems Specifically our contributions are New replication and erasure coding algorithms We present asynchronous voting based algorithms that ensure strictly linearizable accesses 17 2 to replicated or erasure coded data They can handle any non Byzantine failures including brick failures network partitioning and slow bricks Existing algorithms 5 27 in contrast not only lack erasure coding support but also could break consistency when a brick that coordinates a request crashes in the middle A new dynamic quorum reconfiguration algorithm FAB can adjust quorum configurations dynamically while allowing I O requests from clients to proceed unimpeded It improves reliability by allowing the system to tolerate more failures than in a system with fixed quorum voting and by adding a new brick after another brick is decommissioned Efficient implementation and evaluation of FAB We present several techniques that improve the efficiency of these algorithms and
View Full Document
Unlocking...