MultiMap: Preserving disk locality for multidimensional datasets

Home> Academic Documents> MultiMap: Preserving disk locality for multidimensional datasets

DOC PREVIEW

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

MultiMap: Preserving disk locality for multidimensional datasetsMinglong Shao , Steven W. Schlosser†, Stratos Papadomanolakis , Jiri Schindler‡Anastassia Ailamaki , Gregory R. GangerCarnegie Mellon University†Intel Research Pittsburgh‡Network Appliance, IncAbstractMultiMap is an algorithm for mapping multidimensionaldatasets so as to preserve the data’s spatial locality ondisks. Without revealing disk-speciﬁc details to applica-tions, MultiMap exploits modern disk characteristics to pro-vide full streaming bandwidth for one (primary) dimensionand maximally efﬁcient non-sequential access (i.e., mini-mal seek and no rotational latency) for the other dimen-sions. This is in contrast to existing approaches, whicheither severely penalize non-primary dimensions or fail toprovide full streaming bandwidth for any dimension. Exper-imental evaluation of a prototype implementation demon-strates MultiMap’s superior performance for range andbeam queries. On a verage, MultiMap reduces total I/O timeby over 50% when compared to traditional linearized lay-outs and by over 30% when compared to spa ce-ﬁlling curveapproaches such as Z-ordering and Hilbert curves. Forscans of the primary dimension, MultiMap and traditionallinearized layouts provide almost two ord ers of magnitudehigher throughput than space-ﬁlling curve approaches.1 IntroductionApplications accessing multidimen sional datasets are in -creasingly common in modern database systems. The basicrelational model used by conventional database systems or-ganizes information with tables o r relations, which are 2-Dstructures. Spatial databases directly manage multidimen-sional data for applications such as geographic informationsystems, medical image databases, multimedia databases,etc. An increasing number of applications that process mul-tidimensional data run on spatial databases, such as sci-entiﬁc computing applications (e.g., earthquake simulationand oil/gas exploration) and business support systems usingonline analytical processing (OLAP) techniques.Existing mapping algorithms based on the simple linearabstraction of storage devices offered by standard interfacessuch as SCSI are insufﬁcient for workloads that access out-of-core multidimensional datasets. To illustrate the prob-lem, consider mapping a relational database table onto thelinear address space of a single disk d rive or a logical vol-ume consisting of multiple disks. A naive approach requiresmaking a choice between storing the table in row-major orcolumn-major order, trading off access performance alongthe two dimensions. While accessing the table in the pri-mary order is efﬁcient, with requests to sequential diskblocks, access in the other order is inefﬁcient: accesses atregular strides incur short seeks and variable rotational la-tencies, resulting in near-random-access performance. Sim-ilarly, range queries are inefﬁcient if they extend beyond asingle dimension. The problem is more serious for higherdimensional datasets: sequentiality can only b e preservedfor a single dimension and all other dimensions will be, es-sentially, scattered across the disk.The shortcomings of non-sequential disk drive accesseshave motivated a healthy body of research on mapping algo-rithms using space-ﬁlling cu rves, such as Z-ordering [15],Hilbert curves [11], and Gray-coded curves [7]. These ap-proaches traverse the multidimen sional dataset and imposea total order on the dataset when storing data on disks. Theycan help preserve locality for multidimension al datasets, butthey do not allow accesses along any one dimension to takeadvantage of streaming bandwidth, the best performance adisk drive can deliver. This is a high price to pay, since theperformance difference betw een streaming bandwidth andnon-sequential accesses is at least two orders of magnitude.Recent work [22] describes a new generalized model ofdisks, called the adjacency model, for exposing multiple ef-ﬁcient access paths to fetch non-contiguous disk blocks.With this new model, it becomes feasib le to create datamapping algorithms that map multiple data dimensions tophysical disk access paths so as to optimize access to morethan one dimension.This paper describes MultiMap, a data mapping algo-rithm that preserves spatial locality of multidim ensionaldatasets by taking advantage of the adjacency model. Mul-tiMap maps neighboring blocks in the dataset into speciﬁcdisk blocks on nearby tracks, called adjacent blocks,suchIEEE 23rd International Conference on Data Engineering (ICDE 2007) Istanbul, Turkey, April 2007.that they can be accessed for eq ual positioning cost andwithout any rotational latency. We describe a general al-gorithm for MultiMap and evaluate MultiMap on a proto-type implementation that uses a logical volume of real diskdrives with 3-D and 4-D datasets. The results show that,on average, MultiMap reduces total I/O time by over 50%when compared to the naive mapping and by over 30%when compared to space-ﬁllin g curve approaches.The remainder of the paper is organized as follows. Sec-tion 2 d escribes related work. Section 3 outlines the charac-teristics of modern disk technology that enable MultiMap.Section 4 introduces MultiMap in detail. Section 5 evalu-ates the performance of MultiMap, and section 6 concludes.2 Related workOrganizing multidimensional data for efﬁcient accesshas become increasingly important in both scientiﬁc com-puting and business support systems, where dataset sizes areterabytes or more. Queries on these datasets involve dataaccesses on d ifferent dimensions with various access pat-terns [10, 25, 29]. Data storage and management for mas-sive multidimensional data have two primary tasks: dataindexing, for quick location of the needed data, and dataplacement, which arranges data on storage devices for efﬁ-cient retrieval. There is a large body of previous work onthe two closely related but separate topics as they apply tomultidimensional datasets. Our work focuses on data place-ment which happens after indexing.Under the assumption that disks are one-dimensional de-vices, various data placement methods have been proposedin th e literature, such as the Naive mapping, described in theprevious section, and spacing-ﬁlling cu rve mappings utiliz-ing Z-ordering [15], Hilbert [11], or Gray-coded curve [7].Their goal is to order multidimen sional data such that spa-tial locality can be preserved as much as possible withinthe 1-D disk abstraction. The


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Please select your school