Building a Database on S3 Matthias Brantner Daniela Florescu David Graf Donald Kossmann Tim Kraska 28msec Inc Oracle firstname lastname 28msec com dana florescu oracle com Systems Group ETH Zurich firstname lastname inf ethz ch ABSTRACT effective marketing are at least as difficult on the Web as in the real world There are however also technical difficulties One of the most crucial problems is the cost to operate a service on the Web ideally with 24 7 availability and acceptable latency To run a large scale service like YouTube several data centers all around the world are needed But even running a small service with a few friends involves a hosted server and a database which both need to be administrated Running a service becomes particularly challenging and expensive if the service is successful Success on the Web can kill In order to overcome these issues utility computing aka cloud computing has been proposed as a new way to operate services on the Internet 17 The goal of utility computing is to provide the basic ingredients such as storage CPUs and network bandwidth as a commodity by specialized utility providers at low unit cost Users of these utility services do not need to worry about scalability because the storage provided is virtually infinite In addition utility computing provides full availability that is users can read and write data at any time without ever being blocked the response times are virtually constant and do not depend on the number of concurrent users the size of the database or any other system parameter Furthermore users do not need to worry about backups If components fail it is the responsibility of the utility provider to replace them and make the data available using replicas in the meantime Another important reason to build new services based on utility computing is that service providers only pay for what they get i e pay by use No investments are needed upfront and the cost grows linearly and predictably with the usage Depending on the business model it is even possible for the service provider to pass the cost for storage computing and networking to the end customers because the utility provider meters the usage The most prominent utility service today is AWS Amazon Web Services with its simple storage service S3 as the most popular representative Today AWS and in particular S3 are most successful for multi media objects Smugmug www smugmug com for instance is implemented on top of S3 6 Furthermore S3 is popular as a backup device For instance there already exist products to backup data from a MySQL database to S3 16 In summary S3 is already a viable option as a storage medium for large objects which are rarely updated The purpose of this work is to explore whether S3 and related utility computing services are also attractive for other kinds of data i e small objects and as a general purpose store for Web based applications While the advantages of storage systems like S3 are compelling there are also important disadvantages First S3 is slow as compared to an ordinary locally attached disk drive Second storage There has been a great deal of hype about Amazon s simple storage service S3 S3 provides infinite scalability and high availability at low cost Currently S3 is used mostly to store multi media documents videos photos audio which are shared by a community of people and rarely updated The purpose of this paper is to demonstrate the opportunities and limitations of using S3 as a storage system for general purpose database applications which involve small objects and frequent updates Read write and commit protocols are presented Furthermore the cost performance and consistency properties of such a storage system are studied Categories and Subject Descriptors H 2 2 Database Management Physical Design H 2 4 Database Management Systems Concurrency Distributed databases General Terms Algorithms Design Experimentation Performance Keywords Cloud Computing Database AWS Concurrency Eventual Consistency Storage System Cost Trade Off Performance SQS S3 EC2 SimpleDB 1 INTRODUCTION The Web has made it easy to provide and consume content of any form Building a Web page starting a blog and making both searchable for the public have become a commodity Arguably the next wave is to make it easy to provide services on the Web Services such as Flickr YouTube SecondLife or Myspace lead the way The ultimate goal however is to make it easy for everybody to provide such services not just the big guys Unfortunately this is not yet possible Clearly there are non technical issues that make it difficult to start a new service on the Web Having the right business idea and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee SIGMOD 08 June 9 12 2008 Vancouver BC Canada Copyright 2008 ACM 978 1 60558 102 6 08 06 5 00 251 systems like S3 were designed to be cheap and highly available see above thereby sacrificing consistency 7 In S3 for instance it might take an undetermined amount of time before an update to an object becomes visible to all clients Furthermore updates are not necessarily applied in the same order as they were initiated The only guarantee that S3 gives is that updates will eventually become visible to all clients and that the changes persist This property is called eventual consistency 19 If an application has additional consistency requirements then such additional consistency guarantees must be implemented on top of S3 as part of the application The purpose of this paper is to explore how Web based database applications at any scale can be implemented on top of utility services like S3 The paper presents various protocols in order to store read and update objects and indexes using S3 The ultimate goal is to preserve the scalability and availability of a distributed system like S3 and achieve the same level of consistency as a database system i e ACID transactions Unfortunately it is not possible to have it all because of Brewer s famous CAP theorem 10 Given the choice this work follows the distributed systems approach thereby preserving scalability and availability and maximizing the level of consistency
View Full Document