New version page

Correlation Aware Database Designer for Materialized Views and Indexes

This preview shows page 1-2-3-4 out of 11 pages.

View Full Document
View Full Document

End of preview. Want to read all 11 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

CORADD: Correlation Aware Database Designer forMaterialized Views and IndexesHideaki Kimura George Huo Alexander Rasin Samuel Madden Stanley B. ZdonikBrown University Google, Inc. Brown University MIT CSAIL Brown [email protected] [email protected] [email protected] [email protected] [email protected] describe an automatic database design tool that exploits corre-lations between attributes when recommending materialized views(MVs) and indexes. Although there is a substantial body ofrelated work exploring how to select an appropriate set of MVsand indexes for a given workload, none of this work has exploredthe effect of correlated attributes (e.g., attributes encoding relatedgeographic information) on designs. Our tool identifies a set ofMVs and secondary indexes such that correlations between theclustered attributes of the MVs and the secondary indexes areenhanced, which can dramatically improve query performance.It uses a form of Integer Linear Programming (ILP) called ILPFeedback to pick the best set of MVs and indexes for givendatabase size constraints. We compare our tool with a state-of-the-art commercial database designer on two workloads, APB-1and SSB (Star Schema Benchmark—similar to TPC-H). Our resultsshow that a correlation-aware database designer can improve queryperformance up to 6 times within the same space budget whencompared to a commercial database designer.1. INTRODUCTIONCorrelations are extremely common in the attributes of real-world relational datasets. One reason for this is that databases tendto use many attributes to encode related information; for example,area codes, zip codes, cities, states, longitudes, and latitudes allencode spatial data, using slightly different representations, andthese attributes are highly correlated (e.g., a given city name usuallyoccurs in only one or two states.) Similar cases occur in manyapplications; for example, in a retail database, there might beproducts (e.g., cars) with manufacturers and models. In the caseof cars, a given model is likely made by only one manufacturer(e.g., Ford Escape) for a particular set of years (2000–2009) in aparticular set of countries (US), yet these are represented by fourdifferent attributes in the database. Correlations also occur dueto natural relationships between data; for example, in a weatherdatabase, high humidity and high temperatures are correlated, andsunny days are also correlated with hot days. Many other examplesare described in recent related work [3, 7, 11].Previous work has shown that the presence of correlationsbetween different attributes in a relation can have a significantimpact on query performance [7, 11]. If clustered index keys arewell-correlated with secondary index keys, looking up values onthe secondary index may be an order of magnitude faster than theuncorrelated case. As a simple example, consider the correlationbetween city names and state names in a table People (name, city,Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were presented at The36th International Conference on Very Large Data Bases, September 13-17,2010, Singapore.Proceedings of the VLDB Endowment, Vol. 3, No. 1Copyright 2010 VLDB Endowment 2150-8097/10/09... $ 10.00.state, zipcode, salary) with millions of tuples. Suppose wehave a secondary index on city names and our query determinesthe average salary in “Cambridge,” a city in both Massachusettsand in Maine. If the table is clustered by state, which is stronglycorrelated with city name, then the entries of the secondary indexwill only point to a small fraction of the pages in the heap file (thosethat correspond to Massachusetts or Maine.) In the absence ofcorrelations, however, the Cantabrigians will be spread throughoutthe heap file, and our query will require reading many more pages(of course, techniques like bitmap scans, indexes with includedcolumns and materialized views can also be used to improveperformance of such queries.) Thus, even if we run the same queryon the same secondary index in the two cases, query performancecan be an order of magnitude faster with a correlated clusteredindex. We note that such effects are most significant in OLAP (data-warehouse) applications where queries may scan large ranges, incontrast to OLTP databases that tend to perform lookups of single-records by primary keys.Moreover, such correlations can be compactly represented ifattributes are strongly correlated [3, 11]. For example, in [11],we show that by storing only co-occurring distinct values of theclustered and secondary index keys, the secondary index can bedramatically smaller than conventional dense B+Trees, which storeone entry per tuple rather than one tuple per distinct value.This means that, by selecting clustered indexes that are well-correlated with predicated attributes, we can reduce the size andimprove the performance of secondary indexes built over thoseattributes. In many cases, these correlations make secondary indexplans a better choice than sequential scans.Although previous work has shown the benefit of correlationson secondary index performance, it has not shown how to au-tomatically select the best indexes for a given workload in thepresence of correlations. Conversely, previous work on automateddatabase design has not looked at accounting for correlations.Hence, in this work, we introduce a new database design toolnamed CORADD (CORrelation Aware Database Designer) thatis able to take into account attribute correlations. CORADDfirst discovers correlations among attributes and then uses thisinformation to enumerate candidate materialized views (MVs) andclustered indexes over them, selecting a set of candidates that offergood query performance in the presence of correlations. To selectan optimal set of MVs within a given space budget, CORADDchooses amongst the candidates using an optimized integer lin-ear programming (ILP) technique called ILP Feedback. Finally,it builds compressed secondary indexes on the MVs that takeadvantage of the strong correlations, offering good performanceespecially over warehouse-style workloads that


Loading Unlocking...
Login

Join to view Correlation Aware Database Designer for Materialized Views and Indexes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Correlation Aware Database Designer for Materialized Views and Indexes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?