U of M CSCI 8715 - Range Aggregate Processing in Spatial Databases

Unformatted text preview:

Range Aggregate Processingin Spatial DatabasesYufei Tao and Dimitris PapadiasAbstract—A range aggregate query returns summarized information about the points falling in a hyper-rectangle (e.g., the totalnumber of these points instead of their concrete ids). This paper studies spatial indexes that solve such queries efficiently andproposes the aggregate Point-tree (aP-tree), which achieves logarithmic cost to the data set cardinality (independently of the querysize) for two-dimensional data. The aP-tree requires only small modifications to the popular multiversion structural framework and,thus, can be implemented and applied easily in practice. We also present models that accurately predict the space consumption andquery cost of the aP-tree and are therefore suitable for query optimization. Extensive experiments confirm that the proposed methodsare efficient and practical.Index Terms—Database, spatial database, range queries, aggregation.æ1INTRODUCTIONTRADITIONAL research in spatial databases often aims atthe range query, which retrieves the data objects lyinginside (or intersecting) a multidimensional hyper-rectangle.In many scenarios (e.g., statistical analysis, data ware-houses, etc.), however, users are interested only insummarized information about such objects, instead oftheir individual properties. Consider, for example, a spatialdatabase managing the hotels in a city; a query for statisticalpurposes could ask for the number of hotels in a certaindistrict (rather than their respective locations), or theaverage rate per night of these hotels. Furthermore, insome applications such as traffic monitoring, results ofrange queries (e.g., finding all vehicles in the center) aremeaningless (due to frequent object movements), while theaggregate information (the number of vehicles) is usuallyfairly stable (i.e., approximately the same number ofvehicles enter and exit the region within a short period oftime) and measurable (e.g., through sensors across theregion [12]).Specifically, given a set S of points in the d-dimensionalspace, a range aggregate (RA) query returns a single valuethat summarizes the set R  S of points in a d-dimensionalhyper-rectangle q according to some aggregation function.Aggregation functions are divided into three classes [18]:distributive, algebraic, and holistic. Distributive aggregates(e.g., count, max, min, sum ) can be computed by partitioningthe input into disjoint sets, aggregating each set individu-ally and then obtaining the final result by further aggregat-ing the partial results. Algebraic aggregates can beexpressed as a function of distributive aggregates: average,for example, is defined as sum/count. Holistic aggregates(e.g., median), on the other hand, cannot be computed bydividing the input into parts. In this paper, we considerrange count queries on multidimensional data points, wherethe result is the size of R (e.g., the number of hotels in anarea q), but the solutions apply to any distributive, oralgebraic (but not to holistic) aggregates with straightfor-ward adaptation.1.1 MotivationA RA query can be trivially answered as an ordinary rangequery, i.e., by first retrieving the qualifying objects and thenaggregating their properties. This, however, incurs signifi-cant overhead since, intuitively, the objective is to retrieveonly a single value (as opposed to every qualifying object).A common solution is the aggregate index, which augments atraditional spatial access method with summarized infor-mation in intermediate entries (as elaborated in the nextsection). The most popular aggregate index is the aggregateR-tree (aR-tree) [21], [29], [26], which outperforms tradi-tional R-trees considerably on RA retrieval.Aggregate processing of multidimensional objects hasalso been studied theoretically, leading to several interestingresults [40], [16], [41]. As explained in the next section,despite their attractive theoretical bounds, the proposedmethods have limited practical applicability due to severalreasons. First, the asymptotical performance may contain(potentially large) hidden constants, which renders theactual performance of the structure prohibitively expensivein practice. Second, theoretical bounds are insufficient forquery optimization, which requires acc urat e analyti calmodels quantifying the actual space consumption and querycost. Third , the resulting index es either rely on somerestrictive assumptions, or have rather complicated struc-tures and, thus, require considerable implementation efforts.1.2 ContributionsWe first present an asymptotical performance analysis forthe aR-tre e proving that its average RA query co st isIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 12, DECEMBER 2004 1555. Y. Tao is with the Department of Computer Science, City University ofHong Kong, Tat Chee Avenue, Hong Kong.E-mail: [email protected].. D. Papadias is with the Department of Computer Science, Hong KongUniversity of Science and Technology, Clear Water Bay, Hong Kong.E-mail: [email protected] received 9 Apr. 2003; revised 4 Oct. 2003; accepted 10 Nov. 2003.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-0031-0403.1041-4347/04/$20.00 ß 2004 IEEE Published by the IEEE Computer SocietyOððK=BÞðd01Þ=d0Þ, where K is the number of points in thequery window, B the disk page size, and d0the fractaldimensionality [14] of the data set. It is clear that the costdegrades as the query size increases and eventually reachesOððN=BÞðd01Þ=d0Þ,whereN is the data set cardinality.Motivated by this, we develop a new access method, theaggregate Point tree (aP-tree), which achieves logarithmiccost OðlogBNÞ for any query on two-dimensional data. TheaP-tree requires only small modifications of the well-studiedmultiversion structure, thus incurring minimum implemen-tation overhead. We also propose algorithms for bulkloadingand dynamically maintaining the aP-tree and derive aprobabilistic cost model that quantifies its actual space andquery overhead. Finally, we show that the combination of theaR and aP-trees leads to an index that solves RA queries inthree (or higher) dimensional spaces efficiently.Our discussion is based on the typical memory-diskhierarchy, where each I/O access transfers a page of B (i.e.,the page size) units of information from the disk to the mainmemory that contains at least B2pages (a reasonableassumption in


View Full Document

U of M CSCI 8715 - Range Aggregate Processing in Spatial Databases

Documents in this Course
Load more
Download Range Aggregate Processing in Spatial Databases
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Range Aggregate Processing in Spatial Databases and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Range Aggregate Processing in Spatial Databases 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?