DOC PREVIEW
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray Microsoft Adam Bosworth Microsoft Andrew Layman Microsoft Hamid Pirahesh IBM [email protected] [email protected] AndrewL@MicrosotLcom [email protected] Abstract: Data analysis applications typ&ally aggregate data across many dimensions looking for unusual patterns. The SQL aggregate functions and the GROUP BY operator produce zero-dimensional or one-dimensional answers. Applications need the N-dimensional generalization of these operators. This paper defines that operator, called the data cube or simply cube. The cube operator general- izes the histogram, cross-tabulation, roll-up, drill-down, and sub-total constructs found in most report writers. The cube treats each of the N aggregation attributes as a di- mension of N-space. The aggregate of a particular set of attribute values is a point in this space. The set of points forms an N-dimensional cube. Super-aggregates are com- puted by aggregating the N-cube to lower dimensional spaces. Aggregation points are represented by an "infinite value", ALL, so the point (ALL, ALL,...,ALL, sum(*)) repre- sents the global sum of all items. Each ALL value actually represents the set of values contributing to that aggrega- tion. I. Introduction Data analysis applications look for unusual patterns in data. They summarize data values, extract statistical information, and then contrast one category with another. There are two steps to such data analysis: extracting the aggregated data from the database into a file or table, and visualizing the results in a graphical way. Visualization tools display trends, clusters, and differences. The most exciting work in data analysis focuses on pre- senting new graphical metaphors that allow people to dis- cover data trends and anomalies. Many tools represent the dataset as an N-dimensional space. Two and three- dimensional sub-slabs of this space are rendered as 2D or 3D objects. Color and time (motion) add two more dimen- sions to the display giving the potential for a 5D display. How do traditional relational databases fit into this picture? How can flat files (SQL tables) possibly model an N- dimensional problem? Relational systems model N- dimensional data as a relation with N-attribute domains. For example, 4-dimensional earth-temperature data is typi- cally represented by a Weather table shown below. The first four columns represent the four dimensions: x, y, z, t. Additional columns represent measurements at the 4D points such as temperature, pressure, humidity, and wind velocity. Often these measured values are aggregates over time (the hour) or space (a measurement area). Table l:Weather Time (UCT) Latitude Longitude %ititude Tem; Pres (m) (c) (mb) 27/ii/94:150( 37:58:33N 122:45:28W 102 21 1009 27/ii/94:150( 34:16:18N 27:05:55W" I0 23 11024 The SQL standard provides five functions to aggregate the values in a table: COUNT ( ), SUM ( ), MIN ( ), ~ ( ), and AVG(). For example, the average of all measured temperatures is expressed as: SELECT AVG (Temp) FROM Weather; In addition, SQL allows aggregation over distinct values. The following query counts the distinct number of report- ing times in the Weather table: SELECT COUNT (DISTINCT Time) FROM Weather; Many SQL systems add statistical functions (median, stan- dard deviation, variance, etc.), physical functions (center of mass, angular momentum, etc.), financial analysis (volatility, Alpha, Beta, etc.), and other domain-specific functions, Some systems allow users to add new aggregation func- tions. The Illustra system, for example, allows users to add aggregate functions by adding a program with the following three callbacks to the database system [lilustra]: Init (&handle) : Allocates the handle and initializes the aggregate computation. Iter (&handle, value) : Aggregates the next value into the current aggregate. value, = Final (&handle) : Computes and returns the resulting aggregate by using data saved in the handle. This invocation deallocates the handle. Consider implementing the Average() function. The handle stores the count and the sum initialized to zero. When passed a new non-null value, Iter ( ) increments the count and adds the sum to the value. The Final () call dealiocates the handle and returns sum divided by count. 152 1063-6382/96 $5.00 © 1996 IEEEAggregate functions return a single value. Using the GROUP lay construct, SQL can also create a table of many aggregate values indexed by a set of attributes. For exam- ple, The following query reports the average temperature for each reporting time and altitude: SELECT Time, Altitude, gVG(Temp) FROM Weather GROUP BY Time, Altitude; GROUP BY is an unusual relational operator: It partitions the relation into disjoint tuple sets and then aggregates over each set as illustrated in Figure 1. Aggregate Values I--] I--1 Figure !: The GROUP BY relational operator partitions a table into groups. Each group is then aggregated by a function. The aggregation function summarizes some col- umn of groups returning a value for each group. Red Brick systems added some interesting aggregate func- tions that enhance the GROUP BY mechanism [Red Brick]: Rank (expression): returns the expression's rank in the set of all values of this domain of the table. If there are N values in the column, and this is the highest value, the rank is N, if it is the lowest value the rank is 1. N_tile(expression, n): The range of the expression (over all the input values of the table) is computed and divided into n value ranges of approximately equal population. The function returns the number of the range holding the value of the expression. If your bank account was among the largest 10% then your rank (account.balance, 10) would return 10. Red Brick provides just N_tile (expression, 3). Ratio To Total (expression): Sums all the expres- sions and then divides the expression by the total sum. To give an example: SELECT Percentile, MIN (Temp) , MAX (Temp) FROM Weather GROUP BY N_tile (Temp, 10) as Percentile HAVING Percentile = 5; returns one row giving the minimum and maximum tem-


Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Download Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?