U of M CSCI 8715 - Selectivity Estimation in Spatial Databases

Unformatted text preview:

Selectivity Estimation in Spatial Databases Swarup Acharya Viswanath Poosala Sridhar Ramaswamy* Information Sciences Research Center Bell Laboratories, Lucent Technologies 600, Mountain Avenue, Murray Hill, NJ, USA {swarup,poosala}@research.bell-labs.com, {sridhar}@epiphany.com Abstract Selectivity estimation of queries is an important and well- studied problem in relational database systems. In this paper, we examine selectivity estimation in the context of Geographic Information Systems, which manage spatial data such as points, lines, poly-lines and polygons. In particular, we focus on point and range queries over two-dimensional rectangular data. We propose several techniques based on using spatial indices, histograms, binary space partitionings (BSPs), and the novel notion of spatial skew. Our techniques carefully partition the input rectangles into subsets and approximate each partition accurately. We present a detailed experimental study comparing the proposed techniques and the best known sampling and parametric techniques. We evaluate them using synthetic as well as real-life TIGER datasets. Based on our experiments, we identify a BSP based partitioning that we call Min-Skew which consistently provides the most accurate selectivity estimates for spatial queries. The Min- Skew partitioning can be constructed efficiently, occupies very little space, and provides accurate selectivity estimates over a broad range of spatial queries. 1 Introduction Geographic Information Systems (GISs) have gener- ated enormous interest in the commercial and research database communities over the last decade. GISs typi- cally store and manage spatial data such as points, lines, poly-lines, polygons, and surfaces and hence are of- ten referred to as spatial databases. Several commer- cial database systems that manage spatial data are now available. These include ESRI’s ARC/INFO [ARC93], InterGraph’s MGE [Int97], MapInfo [Map981 and In- formix [Ube94]. Most other leading database vendors Work done while the author was at Bell Labs. Current affiliation is Epiphany Inc., 2300 Geng Road, Suite 200, Palo Alto CA 94303. Permission to make digital or hard copies ofall or part ofthis work fi,l Personal or classroom USC is granted without fee plovidcd that copies are wt made or distributed for profit or commercial ndvanttage arid th:,r copies bear this notice and the full citation 011 the first page. ‘f. copv ()tl~cr\~ise, to republish, to post on servers or to rcdistrihutc to lists. rwircs prior specific permission and/or a fee, SIGMOD ‘W Philadciphi~ PA Copyright ACM 1999 I-581 13-084-8/99/05...$~.00 either offer some kind of support for spatial data or are in the process of providing such support. GISs have also been the focus of much research, mostly towards efficient manipulation and access[Sam89a, Sam89bl and more recently, towards research prototypes[DeW94, GRSS97]. As in relational database systems, there are many modules of a spatial database system that require ac- curate estimates of query result sizes. Such estimates are used in a variety of ways. For example, query opti- mizers use query result size estimates to determine the most efficient way to execute queries [SAC+79]. Esti- mates are also used by database systems to give users feedback about the running times of their queries before the queries are actually executed. Since it is impractical to run the entire query to compute the result sizes, most commercial systems use some form of statistics to ap- proximate the underlying data and estimate result sizes based on these statistics. A variety of techniques have been proposed in the literature to compute estimates of query result sizes in relational databases. The most common ones use histogrums[Koo80, Poo97], sampZes[LNS90], or are based on parametric techniques that model the data via a standard mathematical distribution[CR94, BF95]. Of the various techniques, histograms, in particular, have proved very popular in database systems because they can be computed efficiently, use very little space (on the order of a few hundred bytes per relation), and do not require that the input distribution be known in advance.a They work by partitioning the input into a small number of subsets called “buckets”, and by using approximations for each bucket to model the distribution of tuples within. Query result estimates are then obtained by processing the query against the buckets and the approximations used therein. The problem of selectivity estimation for spatial data is very different from the relational selectivity aHowever, one can construct distributions that cannot be approxi- mated well using histograms in a small amount of space. 13estimation problems that have been studied extensively in the database literature. Most previous works have focused on approximating the distributions of single numerical attributes. Even the ones studying multi- dimensional data [PI97, MPS99] have concentrated on approximating theji-equencies of points in space. Many of these techniques are aimed at data with highly skewed frequencies. However, they do not perform as well when the frequency domain is relatively uniform but the value domain (i.e., placement of points in space) is skewed. Spatial selectivity estimation differs in two impor- tant aspects from traditional selectivity estimation: (i) The individual spatial entities may differ in shape and size; (ii) The distribution of frequencies over the in- put domain does not vary dramatically in spatial data, whereas the values are spread non-uniformly in space. (This corresponds to the fact that not many spatial en- tities cover the same point.) Hence, the problem of ap- proximating spatial data requires techniques for accu- rately approximating the value domain of a distribution. Even in the context of uni-dimensional point data, this has been known to be one of the most difficult prob- lems in selectivity estimation [HNSS95, GG82, Poo97]. To the


View Full Document

U of M CSCI 8715 - Selectivity Estimation in Spatial Databases

Documents in this Course
Load more
Download Selectivity Estimation in Spatial Databases
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Selectivity Estimation in Spatial Databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Selectivity Estimation in Spatial Databases 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?