Duke CPS 116 - Data Warehousing and Data Mining - D2087302

Home> Schools> Duke University> (CPS) > CPS 116> Data Warehousing and Data Mining

DOC PREVIEW

Duke CPS 116 - Data Warehousing and Data Mining

School name Duke University

Course Cps 116- Introduction to Database Systems

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Data Warehousing and Data MiningCPS 116Introduction to Database Systems2Announcements (November 25) Homework #3 graded Pick them up from Ying during her office hours Homework #4 due today Sample solution available next Tuesdayddb Course project demo period: December 8-13 Final exam next Saturday, Dec. 13, 7-10pm Again, open book, open notes Focus on the second half of the course Sample final next Tuesday Sample final solution available Thursday3Data integration Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer, … NC branch, NY branch, CA branch, …Nd t ppt OLAP(OLi A l ti l Need to support OLAP(On-Line Analytical Processing) over an integrated view of the data Possible approaches to integration Eager: integrate in advance and store the integrated data at a central repository called the data warehouse Lazy: integrate on demand; process queries over distributed sources—mediated or federated systems24OLTP versus OLAPOLTP Mostly updates Short, simple transactions Clerical usersOLAP Mostly reads Long, complex queries Analysts, decision makers Goal: ACID, transaction throughput Goal: fast queriesImplications on database design and optimization?5Eager versus lazy integrationEager (warehousing) In advance: before queries Copy data from sourcesLazy On demand: at query time Leave data at sources) Answer could be stale ) Answer is more up-to-date) Need to maintain consistency) Query processing is local to the warehouse Faster Can operate when sources are unavailable) No need to maintain consistency) Sources participate in query processing Slower Interferes with local processing6Maintaining a data warehouse The “ETL” process Extraction: extract relevant data and/or changes from sources Transformation: transform data to match the warehouse schema Loading: integrate data/changes into the warehouse Approaches Recomputation• Easy to implement; just take periodic dumps of the sources, say, every night Incremental maintenance• Compute and apply only incremental changes• Fast if changes are small• Not easy to do for complicated transformations• Need to detect incremental changes at the sources37“Star” schema of a data warehouseDimension tableDimension tableF blProductStoreSaleOID date CID PID SID qty pricePID name costp1 beer 10p2 diaper 16…… …SID citys1 Durhams2 Chapel Hills3 RTP…… Big Constantly growing Stores measures (often aggregated in queries)Dimension tableFact table Small Updated infrequentlySaleCustomer100 11/23/2007 c3 p1 s1 1 12102 12/12/2007 c3 p2 s1 2 17105 12/24/2007 c5 p1 s3 5 13…… ……………CID name address cityc3 Amy 100 Main St. Durhamc4 Ben 102 Main St. Durhamc5 Coy 800 Eighth St. Durham…… … …8Data cubeProduct(c3, p2, s1) = 2(c5, p1, s3) = 5Simplified schema: Sale (CID, PID, SID, qty)CustomerStoreALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 19ProductCompleting the cube—plane(ALL, p1, s3) = 5(ALL, p2, s1) = 2Total quantity of sales for each product in each store(c3, p2, s1) = 2(c5, p1, s3) = 5SELECT PID, SID, SUM(qty) FROM SaleGROUP BY PID, SID;CustomerStore(ALL, p1, s1) = 4ALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 1Project all points onto Product-Store plane410Completing the cube—axis(ALL, p1, s3) = 5(ALL, p2, s1) = 2Total quantity of sales for each product(c3, p2, s1) = 2(c5, p1, s3) = 5SELECT PID, SUM(qty) FROM Sale GROUP BY PID;Product(ALL, p2, ALL)= 2(ALL, p1, ALL)= 9(ALL, p1, s1) = 4ALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 1Further project points onto Product axisCustomerStore11ProductCompleting the cube—origin(ALL, p1, s3) = 5(ALL, p2, s1) = 2Total quantity of sales(c3, p2, s1) = 2(c5, p1, s3) = 5SELECT SUM(qty) FROM Sale;CustomerStore(ALL, p1, s1) = 4ALLp1p2s1s2s3c3 c4 c5(c5, p1, s1) = 3(c3, p1, s1) = 1Further project points onto the origin(ALL, p2, ALL)= 2(ALL, p1, ALL)= 9(ALL, ALL, ALL) = 1112CUBE operator Sale (CID, PID, SID, qty) Proposed SQL extension:SELECT SUM(qty) FROM SaleGROUP BY CUBE CID, PID, SID; Output contains: Normal groups produced by GROUP BY• (c1, p1, s1, sum), (c1, p2, s3, sum), etc. Groups with one or more ALL’s• (ALL, p1, s1, sum), (c2, ALL, ALL, sum), (ALL, ALL, ALL, sum), etc. Can you write a CUBE query using only GROUP BY’s?Gray et al., “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total.” ICDE 1996513Automatic summary tables Computing GROUP BY and CUBE aggregates is expensive OLAP queries perform these operations over and over again) Idea: precompute and store the aggregates as automatic summary tables (a DB2 term) Maintained automatically as base data changes Same as materialized views14Aggregation view latticeGROUP BY ∅GROUP BYCIDGROUP BYPIDGROUP BYSIDRoll upGROUP BYCID, PID, SIDGROUP BYCID, PIDGROUP BYCID, SIDGROUP BYPID, SIDA parent can becomputed from any childDrill down15Selecting views to materialize Factors in deciding what to materialize What is its storage cost? What is its update cost? Which queries can benefit from it?How much can a query benefit from it?How much can a query benefit from it? Example GROUP BY ∅ is small, but not useful to most queries GROUP BY CID, PID, SID is useful to any query, but too large to be beneficialHarinarayan et al., “Implementing Data Cubes Efficiently.” SIGMOD1996616Data mining Data → knowledge DBMS meets AI and statistics Clustering, prediction (classification and regression), association analysis, outlier analysis, evolution y, y,analysis, etc. Usually complex statistical “queries” that are difficult to answer → often specialized algorithms outside DBMS We will focus on frequent itemset mining17Mining frequent itemsets Given: a large database of transactions, each containing a set of items Example: market basketsTID itemsT001 diaper, milk, candyT002 milk, eggT003 milk, beerT004 diaper, milk, egg Find all frequent itemsets A set of items X is frequent if no less than smin% of all transactions contain X Examples: {diaper, beer}, {scanner, color printer}T005 diaper, beerT006 milk, beerT007 diaper, beerT008 diaper, milk, beer, candyT009 diaper, milk, beer……18First try A naïve algorithm Keep a running count for each possible itemset For each

View Full Document