DOC PREVIEW
UW-Madison CS 736 - Improving Content Addressable Storage for Databases

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 736 Conference on Reliable Awesome Projects (no acronyms please), 2008 1Improving Content Addressable Storage for Databases Vandhana Selvaprakash Brandon M. Smith Department of Computer Sciences University of Wisconsin, Madison, WI 53706 E-mail: {vandhana,bmsmith}@cs.wisc.edu Abstract Recent trends have seen database clients use content addressable storage systems (CASs) for near-line storage. While CASs have many attractive properties such as storage space savings, data integrity, and low network bandwidth requirements, CASs are not well suited for databases, mainly because of the rigid structure of databases, and the way databases intersperse metadata with data. In this paper, we evaluate where current CAS techniques fail for databases and identify properties of database systems that can be leveraged for potential improvements to CAS techniques specific to databases. We propose fives ways in which CAS systems can be made “database aware,” and evaluate the potential strengths and weakness of each approach. We find that our techniques improve memory savings, but at the expense of coupling the solution too closely to particular database vendors. 1 Introduction CAS divides files into chunks (other names include segments [3], blocks [1], atomics [7], or fragments [6]). Each chunk is indexed by a descriptor, or fingerprint, that uniquely identifies its contents. Files are hence content addressable, rather than location addressable. Identical chunks have the same fingerprint, which insures that redundant data is never stored twice. This technique substantially reduces the amount of storage space required, provided that duplicate information exists in the file system. Virtually all the systems we study use the cryptographic SHA-1 hash function [5] to calculate the fingerprint for each chunk. One of the most desirable properties of this hash function is that it is very efficient to compute. Yet another desirable property is its strong collision resistance. With extremely high probability1, 1 Assuming random hash values and a uniform distribution, a fingerprint collision for an exabyte (10^18 bytes) of data divided into 8 kilobyte chunks will occur with probability less than 10^-20, which is many orders of magnitude less than hardware error rates [1]. each fingerprint will uniquely describe the contents of its corresponding chunk. A three-way distinction can be made between chunking techniques. Data can be chunked based on: (1) whole file content, (2) fixed-size chunks, or (3) variable-size, content-based chunks. In (1) the chunk boundaries are simply determined by the beginning and ending of the files being stored. Techniques based on (2) determine chunk boundaries by fixed offsets from the beginning of each file. Techniques based on (3) are more complicated and typically involve discovering anchor points throughout each file based on its contents. This is usually accomplished using some variant of the Rabin fingerprinting technique described by Udi Manber in [4]. Although (1) and (2) are simple, in most situations they effectively make files stored on the system immutable. Variable-size chunking techniques do not suffer from this problem. 2 Motivation Content addressable storage systems (CASs) are primarily used to store data that changes rarely, and is accessed infrequently (write once, read many). Databases increasingly use CASs for storing backup information, and snapshots. Unlike traditional CAS data (business data, MRI Images), which are loosely coupled with metadata(if any), databases are not well-suited for CAS. The problem with database storage structures is that metadata is A) substantial, and B) tightly interspersed with data. It is therefore difficult to exploit data redundancy for space savings. Another limitation of current CAS techniques is that they only exploit identical files, blocks, or chunks. Typically, many pages within database storage structures are similar, but very few are identical. Databases make up a significant portion of storage system clients. Simply said, they are too important to ignore. Given the increasing use of CASs by database systems, we aim to improve CAS techniques specifically for the database domain.CS 736 Conference on Reliable Awesome Projects (no acronyms please), 2008 2Table 1: target chunk size vs. actual average chunk size Data type Target Chunk Size Actual Average Size(bytes) (bytes)128 175256 303512 5591024 10702048 20944096 41378192 822016384 1641332768 32848330 MB Sparse Oracle Tablespace 8192 213355 MB Sparse PostgreSQL Tablespace 8192 15443560 MB Mix of Files 4096 1100(pdf, txt, doc, mpg, jpg, etc...) 8192 184916384 20621 GB Music Data(mp3, wma, wav, m4a) Our approach to this problem is as follows. We implement several key features of state-of-the-art CASs and evaluate these features on different kinds of data, including music files, text data, and database data (Oracle [10] and PostgreSQL [9].) We determine quantitatively that recent variable-sized chunking techniques perform poorly on database data. We then study the low-level storage structure of databases (mainly Oracle and PostgreSQL) and identify properties that can be leveraged for potential improvements to CAS techniques. We propose five ways in which CAS systems can be made “database aware,” and evaluate the potential strengths and weakness of each approach. We find that our techniques improve memory savings, but at the expense of coupling the solution too closely to particular database vendors. The rest of the paper is organized as follows. Section 3 current state-of-the-art CAS techniques; Section 4 discusses database semantics; Section 5 discusses five approaches to improve CAS techniques for databases; Section 6 discusses related work; finally, we end with some conclusions. 3 Evaluation of State-of-the-Art CAS techniques 3.1 Implementation To evaluate state-of-the-art CASs [3], we implement the following: • Variable-size chunking based on Rabin fingerprinting • An efficient SHA-1 hash (chunk fingerprint) index storage structure based on a binary search tree • A summary vector to further optimize the speed of chunk lookup We implement a variable-size chunking tech-nique based on a widely cited paper by Udi Manber [4]. The technique is based on using Rabin fingerprints to discover anchor points within each file. A sliding 48-byte2 window scans through each file. At each new byte, the fingerprint is updated and the


View Full Document

UW-Madison CS 736 - Improving Content Addressable Storage for Databases

Documents in this Course
Load more
Download Improving Content Addressable Storage for Databases
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Improving Content Addressable Storage for Databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Improving Content Addressable Storage for Databases 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?