1Data Warehousing and Data MiningCPS 116Introduction to Database Systems2Announcements (November 25) Homework #3 graded Pick them up from Ying during her office hours Homework #4 due today Sample solution available next Tuesdayddb Course project demo period: December 8-13 Final exam next Saturday, Dec. 13, 7-10pm Again, open book, open notes Focus on the second half of the course Sample final next Tuesday Sample final solution available Thursday3Data integration Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer, … NC branch, NY branch, CA branch, …Nd t ppt OLAP(OLi A l ti l Need to support OLAP(On-Line Analytical Processing) over an integrated view of the data Possible approaches to integration Eager: integrate in advance and store the integrated data at a central repository called the data warehouse Lazy: integrate on demand; process queries over distributed sources—mediated or federated systems24OLTP versus OLAPOLTP Mostly updates Short, simple transactions Clerical usersOLAP Mostly reads Long, complex queries Analysts, decision makers Goal: ACID, transaction throughput Goal: fast queriesImplications on database design and optimization?5Eager versus lazy integrationEager (warehousing) In advance: before queries Copy data from sourcesLazy On demand: at query time Leave data at sources) Answer could be stale ) Answer is more up-to-date) Need to maintain consistency) Query processing is local to the warehouse Faster Can operate when sources are unavailable) No need to maintain consistency) Sources participate in query processing Slower Interferes with local processing6Maintaining a data warehouse The “ETL” process Extraction: extract relevant data and/or changes from sources Transformation: transform data to match the warehouse schema Loading: integrate data/changes into the warehouse Approaches Recomputation• Easy to implement; just take periodic dumps of the sources, say, every night Incremental maintenance• Compute and apply only incremental changes• Fast if changes are small• Not easy to do for complicated transformations• Need to detect incremental changes at the sources37“Star” schema of a data warehouseDimension tableDimension tableF blProductStoreSaleOID date CID PID SID qty pricePID name costp1 beer 10p2 diaper 16…… …SID citys1 Durhams2 Chapel Hills3 RTP…… Big Constantly growing Stores measures (often aggregated in queries)Dimension tableFact table Small Updated infrequentlySaleCustomer100 11/23/2007 c3 p1 s1 1 12102 12/12/2007 c3 p2 s1 2 17105 12/24/2007 c5 p1 s3 5 13…… ……………CID name address cityc3 Amy 100 Main St. Durhamc4 Ben 102 Main St. Durhamc5 Coy 800 Eighth St. Durham…… … …8Data cubeProduct(c3, p2, s1) = 2(c5, p1, s3) = 5Simplified schema: Sale (CID, PID, SID, qty)CustomerStoreALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 19ProductCompleting the cube—plane(ALL, p1, s3) = 5(ALL, p2, s1) = 2Total quantity of sales for each product in each store(c3, p2, s1) = 2(c5, p1, s3) = 5SELECT PID, SID, SUM(qty) FROM SaleGROUP BY PID, SID;CustomerStore(ALL, p1, s1) = 4ALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 1Project all points onto Product-Store plane410Completing the cube—axis(ALL, p1, s3) = 5(ALL, p2, s1) = 2Total quantity of sales for each product(c3, p2, s1) = 2(c5, p1, s3) = 5SELECT PID, SUM(qty) FROM Sale GROUP BY PID;Product(ALL, p2, ALL)= 2(ALL, p1, ALL)= 9(ALL, p1, s1) = 4ALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 1Further project points onto Product axisCustomerStore11ProductCompleting the cube—origin(ALL, p1, s3) = 5(ALL, p2, s1) = 2Total quantity of sales(c3, p2, s1) = 2(c5, p1, s3) = 5SELECT SUM(qty) FROM Sale;CustomerStore(ALL, p1, s1) = 4ALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 1Further project points onto the origin(ALL, p2, ALL)= 2(ALL, p1, ALL)= 9(ALL, ALL, ALL) = 1112CUBE operator Sale (CID, PID, SID, qty) Proposed SQL extension:SELECT SUM(qty) FROM SaleGROUP BY CUBE CID, PID, SID; Output contains: Normal groups produced by GROUP BY• (c1, p1, s1, sum), (c1, p2, s3, sum), etc. Groups with one or more ALL’s• (ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc. Can you write a CUBE query using only GROUP BY’s?Gray et al., “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total.” ICDE 1996513Automatic summary tables Computing GROUP BY and CUBE aggregates is expensive OLAP queries perform these operations over and over again) Idea: precompute and store the aggregates as automatic summary tables (a DB2 term) Maintained automatically as base data changes Same as materialized views14Aggregation view latticeGROUP BY ∅GROUP BYCIDGROUP BYPIDGROUP BYSIDRoll upGROUP BYCID, PID, SIDGROUP BYCID, PIDGROUP BYCID, SIDGROUP BYPID, SIDA parent can becomputed from any childDrill down15Selecting views to materialize Factors in deciding what to materialize What is its storage cost? What is its update cost? Which queries can benefit from it?How much can a query benefit from it?How much can a query benefit from it? Example GROUP BY ∅ is small, but not useful to most queries GROUP BY CID, PID, SID is useful to any query, but too large to be beneficialHarinarayan et al., “Implementing Data Cubes Efficiently.” SIGMOD1996616Data mining Data → knowledge DBMS meets AI and statistics Clustering, prediction (classification and regression), association analysis, outlier analysis, evolution y, y,analysis, etc. Usually complex statistical “queries” that are difficult to answer → often specialized algorithms outside DBMS We will focus on frequent itemset mining17Mining frequent itemsets Given: a large database of transactions, each containing a set of items Example: market basketsTID itemsT001 diaper, milk, candyT002 milk, eggT003 milk, beerT004 diaper, milk, egg Find all frequent itemsets A set of items X is frequent if no less than smin% of all transactions contain X Examples: {diaper, beer}, {scanner, color printer}T005 diaper, beerT006 milk, beerT007 diaper, beerT008 diaper, milk, beer, candyT009 diaper, milk, beer……18First try A naïve algorithm Keep a running count for each possible itemset For each
View Full Document