Duke CPS 116 - Data Warehousing - D1872140

Home> Schools> Duke University> (CPS) > CPS 116> Data Warehousing

DOC PREVIEW

Duke CPS 116 - Data Warehousing

School name Duke University

Course Cps 116- Introduction to Database Systems

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Data WarehousingCPS 216Advanced Database Systems2ReviewData warehousing: integrating data for OLAP• OLAP versus OLTP• Warehousing versus mediation• Warehouse maintenance– Warehouse data as materialized views– Recomputation versus incremental maintenance– Self-maintenance3Today• Star, snowflake, and cube• ROLAP and MOLAP algorithms24Star schemaOID date CID PID SID qty price100 11/23/2001 c3 p1 s1 1 12102 12/12/2001 c3 p2 s1 2 17105 12/24/2001 c5 p1 s3 5 13…………………SaleProductPID name costp1 beer 10p2 diaper 16…… …StoreSID citys1 Durhams2 Chapel Hills3 RTP……CID name address cityc3 Amy 100 Main St. Durhamc4 Ben 102 Main St. Durhamc5 Carl 800 Eighth St. Durham…… … …CustomerFact table•Big• Constantly growing• Stores measures(often aggregated inqueries)Dimension table•Small• Updated infrequentlyDimension tableDimension table5Dimension hierarchiesPID brandID typeIDp1 b1 t1p2 b2 t2p3 b1 t4…… …brandID nameb1 Huggiesb2 Budwiser……typeID name catIDt1 diaper cat7t2 beer cat9t3 juice cat9t4 pacifier cat7………catID namecat7 babiescat9 beverages……ProductBrandProduct type Product category+ star schema = snowflake schema6Star join indexes• Queries frequently join fact table with dimension tables» Materialize the join result to speed up queries• For each combination of dimension attribute values, store the list of tuple ID’s in the fact table– Brand name, store city, customer city → sales records;Product type, store city → sales records; etc.– Conceptually, multi-attribute indexes on the join result• One index to support each combination of selection conditions on attributes?– Too many indexes!37Bitmap join indexes» O’Neil & Quass, SIGMOD 1997• Bitmap and projection indexes for each dimension attribute– Value of the dimension attribute ↔ tuple ID’s in the fact table• To process an arbitrary combination of selection conditions, use bitmap indexes– Bitmaps can be combined efficiently• To retrieve attribute values for output, use projection indexes8Data cubeCustomerStoreProductALLp1p2s1s2s3c3 c4 c5(c3, p2, s1) = 2(c5, p1, s3) = 5Simplified schema: Sale(CID, PID, SID, qty)(c5, p1, s1) = 3(c3, p1, s1) = 19(ALL, p1, s3) = 5(ALL, p2, s1) = 2(ALL, p1, s1) = 4Completing the cube (slide 1)Total quantity of sales for each product in each storeCustomerStoreProductALLp1p2s1s2s3c3 c4 c5(c3, p2, s1) = 2(c5, p1, s3) = 5(c5, p1, s1) = 3(c3, p1, s1) = 1SELECT SUM(qty) FROM SaleGROUP BY PID, SID;Project all points onto Product-Store plane410(ALL, p2, ALL)= 2(ALL, p1, ALL)= 9(ALL, p1, s3) = 5(ALL, p2, s1) = 2(ALL, p1, s1) = 4Completing the cube (slide 2)Total quantity of sales for each productCustomerStoreProductALLp1p2s1s2s3c3 c4 c5(c3, p2, s1) = 2(c5, p1, s3) = 5(c5, p1, s1) = 3(c3, p1, s1) = 1SELECT SUM(qty) FROM Sale GROUP BY PID;Further project points onto Product axis11Completing the cube (slide 3)(ALL, p2, ALL)= 2(ALL, p1, ALL)= 9(ALL, p1, s3) = 5(ALL, p2, s1) = 2(ALL, p1, s1) = 4Total quantity of salesCustomerStoreProductALLp1p2s1s2s3c3 c4 c5(c3, p2, s1) = 2(c5, p1, s3) = 5(c5, p1, s1) = 3(c3, p1, s1) = 1SELECT SUM(qty) FROM Sale;Further project points onto the origin(ALL, ALL, ALL) = 1112CUBE operator» Gray et al., ICDE 1996• Sale(CID, PID, SID, qty)• Proposed SQL extension:SELECT SUM(qty) FROM SaleGROUP BY CUBE CID, PID, SID;• Output contains:– Normal groups produced by GROUP BY• (c1, p1, s1, sum), (c1, p2, s3, sum), etc.– Groups with one or more ALL’s• (ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc.• Can you write a CUBE query using only GROUP BY’s?513ROLLUP operator• Sometimes CUBE is too much– (…, state, city, street, …, age, DOB, …)– CUBE state, city, street returns meaningless groups• (ALL, ALL, ’Main Street’): sales on any Main Street?– CUBE age, DOB returns useless groups• (ALL, DOB): DOB functionally determines age!• Proposed SQL extension:GROUP BY ROLLUP state, city, street;• Output contains groups with ALL’s only as suffix– (’NC’, ’Durham’, ’Main Street’), (’NC’, ’Durham’, ALL), (’NC’, ALL, ALL), (ALL, ALL, ALL)– But not (ALL, ALL, ’Main Street’) or (ALL, ’Durham’, ALL)14Computing GROUP BY• ROLAP (Relational OLAP)– Use standard relational engine– Sorting and clustering– Using indexes– Automatic summary tables• MOLAP (Multidimensional OLAP)– Use a sparse multidimensional array15Sorting and clustering• Sort (or cluster, e.g., using hashing) tuplesaccording to GROUP BY attributes– Tuples in the same group are processed together– Only one intermediate aggregate result needs to be kept—low memory requirement• What if GROUP BY attributes ≠ sort attributes?– Still fine if GROUP BY attributes form a prefix of the sort order– Otherwise, need to keep intermediate aggregate results around616More on sort order• Sort by the order in which GROUP BY attributes appear?– Not necessary; e.g., GROUP BY PID, SID can be processed just as efficiently by sorting on SID, PID• Sort by the order in which GROUP BY ROLLUP attributes appear?– Useful; e.g., GROUP BY ROLLUP state, city, street can be processed efficiently by sorting on state, city, street, but not by sorting on street, city, state17Using bitmap join indexes» O’Neil & Quass, SIGMOD 1997• Use the bitmap join indexes on GB1, GB2, …, GBk• For each value v1of GB1in order:For each value v2of GB2in order: …For each value vkof GBkin order:Intersect bitmaps to locate tuples;Retrieve their measures;Calculate aggregate for group (v1, v2, …, vk);• Helps if data is sorted by GB1, GB2, …, GBk– So measures in the same group are clustered18Automatic summary tables• Computing GROUP BY aggregates is expensive• OLAP queries perform GROUP BY all the time• Idea: precompute and store the aggregates!» Automatic summary tables– Maintained automatically as base data changes– Just another index/materialized view719Aggregation view latticeGROUP BY ∅GROUP BYCID, PID, SIDGROUP BYCIDGROUP BYPIDGROUP BYSIDGROUP BYCID, PIDGROUP BYCID, SIDGROUP BYPID, SIDA child can becomputed from any parent20Selecting views to materialize• Factors in deciding what to materialize– What is its storage cost?– What is its update cost?– Which queries can benefit from it?– How much can a query benefit from it?• Example– GROUP BY ∅ is small, but not useful to most queries– GROUP BY CID, PID, SID is useful to any

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 10 pages.

Duke CPS 116 - Data Warehousing

Sign up for free to view:

Please select your school