DOC PREVIEW
MultiMap: Preserving disk locality for multidimensional datasets

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

MultiMap: Preserving disk locality for multidimensional datasetsMinglong Shao , Steven W. Schlosser†, Stratos Papadomanolakis , Jiri Schindler‡Anastassia Ailamaki , Gregory R. GangerCarnegie Mellon University†Intel Research Pittsburgh‡Network Appliance, IncAbstractMultiMap is an algorithm for mapping multidimensionaldatasets so as to preserve the data’s spatial locality ondisks. Without revealing disk-specific details to applica-tions, MultiMap exploits modern disk characteristics to pro-vide full streaming bandwidth for one (primary) dimensionand maximally efficient non-sequential access (i.e., mini-mal seek and no rotational latency) for the other dimen-sions. This is in contrast to existing approaches, whicheither severely penalize non-primary dimensions or fail toprovide full streaming bandwidth for any dimension. Exper-imental evaluation of a prototype implementation demon-strates MultiMap’s superior performance for range andbeam queries. On a verage, MultiMap reduces total I/O timeby over 50% when compared to traditional linearized lay-outs and by over 30% when compared to spa ce-filling curveapproaches such as Z-ordering and Hilbert curves. Forscans of the primary dimension, MultiMap and traditionallinearized layouts provide almost two ord ers of magnitudehigher throughput than space-filling curve approaches.1 IntroductionApplications accessing multidimen sional datasets are in -creasingly common in modern database systems. The basicrelational model used by conventional database systems or-ganizes information with tables o r relations, which are 2-Dstructures. Spatial databases directly manage multidimen-sional data for applications such as geographic informationsystems, medical image databases, multimedia databases,etc. An increasing number of applications that process mul-tidimensional data run on spatial databases, such as sci-entific computing applications (e.g., earthquake simulationand oil/gas exploration) and business support systems usingonline analytical processing (OLAP) techniques.Existing mapping algorithms based on the simple linearabstraction of storage devices offered by standard interfacessuch as SCSI are insufficient for workloads that access out-of-core multidimensional datasets. To illustrate the prob-lem, consider mapping a relational database table onto thelinear address space of a single disk d rive or a logical vol-ume consisting of multiple disks. A naive approach requiresmaking a choice between storing the table in row-major orcolumn-major order, trading off access performance alongthe two dimensions. While accessing the table in the pri-mary order is efficient, with requests to sequential diskblocks, access in the other order is inefficient: accesses atregular strides incur short seeks and variable rotational la-tencies, resulting in near-random-access performance. Sim-ilarly, range queries are inefficient if they extend beyond asingle dimension. The problem is more serious for higherdimensional datasets: sequentiality can only b e preservedfor a single dimension and all other dimensions will be, es-sentially, scattered across the disk.The shortcomings of non-sequential disk drive accesseshave motivated a healthy body of research on mapping algo-rithms using space-filling cu rves, such as Z-ordering [15],Hilbert curves [11], and Gray-coded curves [7]. These ap-proaches traverse the multidimen sional dataset and imposea total order on the dataset when storing data on disks. Theycan help preserve locality for multidimension al datasets, butthey do not allow accesses along any one dimension to takeadvantage of streaming bandwidth, the best performance adisk drive can deliver. This is a high price to pay, since theperformance difference betw een streaming bandwidth andnon-sequential accesses is at least two orders of magnitude.Recent work [22] describes a new generalized model ofdisks, called the adjacency model, for exposing multiple ef-ficient access paths to fetch non-contiguous disk blocks.With this new model, it becomes feasib le to create datamapping algorithms that map multiple data dimensions tophysical disk access paths so as to optimize access to morethan one dimension.This paper describes MultiMap, a data mapping algo-rithm that preserves spatial locality of multidim ensionaldatasets by taking advantage of the adjacency model. Mul-tiMap maps neighboring blocks in the dataset into specificdisk blocks on nearby tracks, called adjacent blocks,suchIEEE 23rd International Conference on Data Engineering (ICDE 2007) Istanbul, Turkey, April 2007.that they can be accessed for eq ual positioning cost andwithout any rotational latency. We describe a general al-gorithm for MultiMap and evaluate MultiMap on a proto-type implementation that uses a logical volume of real diskdrives with 3-D and 4-D datasets. The results show that,on average, MultiMap reduces total I/O time by over 50%when compared to the naive mapping and by over 30%when compared to space-fillin g curve approaches.The remainder of the paper is organized as follows. Sec-tion 2 d escribes related work. Section 3 outlines the charac-teristics of modern disk technology that enable MultiMap.Section 4 introduces MultiMap in detail. Section 5 evalu-ates the performance of MultiMap, and section 6 concludes.2 Related workOrganizing multidimensional data for efficient accesshas become increasingly important in both scientific com-puting and business support systems, where dataset sizes areterabytes or more. Queries on these datasets involve dataaccesses on d ifferent dimensions with various access pat-terns [10, 25, 29]. Data storage and management for mas-sive multidimensional data have two primary tasks: dataindexing, for quick location of the needed data, and dataplacement, which arranges data on storage devices for effi-cient retrieval. There is a large body of previous work onthe two closely related but separate topics as they apply tomultidimensional datasets. Our work focuses on data place-ment which happens after indexing.Under the assumption that disks are one-dimensional de-vices, various data placement methods have been proposedin th e literature, such as the Naive mapping, described in theprevious section, and spacing-filling cu rve mappings utiliz-ing Z-ordering [15], Hilbert [11], or Gray-coded curve [7].Their goal is to order multidimen sional data such that spa-tial locality can be preserved as much as possible withinthe 1-D disk abstraction. The


MultiMap: Preserving disk locality for multidimensional datasets

Download MultiMap: Preserving disk locality for multidimensional datasets
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view MultiMap: Preserving disk locality for multidimensional datasets and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view MultiMap: Preserving disk locality for multidimensional datasets 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?