Purdue CS 59000 - Data Mining - D1839256

Home> Schools> Purdue University> Computer Sciences (CS) > CS 59000> Data Mining

DOC PREVIEW

Purdue CS 59000 - Data Mining

School name Purdue University

Course Cs 59000- Topics in Computer Sciences

Pages 67

This preview shows page 1-2-3-4-31-32-33-34-35-64-65-66-67 out of 67 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 67 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS590D: Data MiningProf. Chris CliftonJanuary 25, 2005Association RulesCS590D 2Mining Association Rules in Large Databases• Association rule mining• Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases• Mining various kinds of association/correlation rules • Constraint-based association mining• Sequential pattern mining• Applications/extensions of frequent pattern mining• Summary2CS590D 3What Is Association Mining?• Association rule mining:– Finding frequent patterns, associations, correlations, or causalstructures among sets of items or objects in transaction databases, relational databases, and other information repositories.– Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93]• Motivation: finding regularities in data– What products were often purchased together? — Beer and diapers?!– What are the subsequent purchases after buying a PC?– What kinds of DNA are sensitive to this new drug?– Can we automatically classify web documents?CS590D 4Why Is Association Mining Important?• Foundation for many essential data mining tasks– Association, correlation, causality– Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association– Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)• Broad applications– Basket data analysis, cross-marketing, catalog design, sale campaign analysis– Web log (click stream) analysis, DNA sequence analysis, etc.35Basic Concepts:Association RulesB, E, F40A, D30A, C20A, B, C10Items boughtTransaction-id • Itemset X={x1, …, xk}• Find all the rules XY with min confidence and support– support, s, probability that a transaction contains X∪Y– confidence, c, conditional probability that a transaction having X also contains Y.Let min_support = 50%, min_conf = 50%:A  C (50%, 66.7%)C  A (50%, 100%)Customerbuys diaperCustomerbuys bothCustomerbuys beerCS590D 6Mining Association Rules:ExampleFor rule A ⇒ C:support = support({A}∪{C}) = 50%confidence = support({A}∪{C})/support({A}) = 66.6%Min. support 50%Min. confidence 50%B, E, F40A, D30A, C20A, B, C10Items boughtTransaction-id50%{A, C}50%{C}50%{B}75%{A}SupportFrequent pattern4CS590D 7Mining Association Rules:What We Need to Know• Goal: Rules with high support/confidence• How to compute?– Support: Find sets of items that occur frequently– Confidence: Find frequency of subsets of supported itemsets• If we have all frequently occurring sets of items (frequent itemsets), we can compute support and confidence!CS590D 8Mining Association Rules in Large Databases• Association rule mining• Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases• Mining various kinds of association/correlation rules • Constraint-based association mining• Sequential pattern mining• Applications/extensions of frequent pattern mining• Summary5CS590D 9Apriori: A Candidate Generation-and-Test Approach• Any subset of a frequent itemset must be frequent– if {beer, diaper, nuts} is frequent, so is {beer, diaper}– Every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!• Method: – generate length (k+1) candidate itemsets from length k frequent itemsets, and– test the candidates against DB• Performance studies show its efficiency and scalability• Agrawal & Srikant 1994, Mannila, et al. 199410Frequency ≥ 50%, Confidence 100%:A  CB  EBC  ECE  BBE  CThe Apriori Algorithm—An ExampleDatabase TDB1stscanC1L1L2C2C22ndscanC3L33rdscanB, E40A, B, C, E30B, C, E20A, C, D10ItemsTid1{D}3{E}3{C}3{B}2{A}supItemset3{E}3{C}3{B}2{A}supItemset{C, E}{B, E}{B, C}{A, E}{A, C}{A, B}Itemset1{A, B}2{A, C}1{A, E}2{B, C}3{B, E}2{C, E}supItemset2{A, C}2{B, C}3{B, E}2{C, E}supItemset{B, C, E}Itemset2{B, C, E}supItemset6CS590D 11The Apriori Algorithm• Pseudo-code:Ck: Candidate itemset of size kLk: frequent itemset of size kL1= {frequent items};for (k = 1; Lk!=∅; k++) do beginCk+1= candidates generated from Lk;for each transaction t in database doincrement the count of all candidates in Ck+1that are contained in tLk+1= candidates in Ck+1with min_supportendreturn∪kLk;CS590D 12Important Details of Apriori• How to generate candidates?– Step 1: self-joining Lk– Step 2: pruning• How to count supports of candidates?• Example of Candidate-generation– L3={abc, abd, acd, ace, bcd}– Self-joining: L3*L3• abcd from abc and abd• acde from acd and ace– Pruning:• acde is removed because ade is not in L3– C4={abcd}7CS590D 13How to Generate Candidates?• Suppose the items in Lk-1are listed in an order• Step 1: self-joining Lk-1insert into Ckselect p.item1, p.item2, …, p.itemk-1, q.itemk-1from Lk-1p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1• Step 2: pruning∀ itemsets c in Ckdo∀ (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from CkCS590D 14How to Count Supports of Candidates?• Why counting supports of candidates a problem?– The total number of candidates can be very huge– One transaction may contain many candidates• Method:– Candidate itemsets are stored in a hash-tree– Leaf node of hash-tree contains a list of itemsets and counts– Interior node contains a hash table– Subset function: finds all the candidates contained in a transaction8CS590D 15Example: Counting Supports of Candidates1,4,72,5,83,6,9Subset function2 3 45 6 71 4 51 3 61 2 44 5 71 2 54 5 81 5 93 4 53 5 63 5 76 8 93 6 73 6 8Transaction: 1 2 3 5 61 + 2 3 5 61 2 + 3 5 61 3 + 5 6CS590D 16Efficient Implementation of Apriori in SQL• Hard to get good performance out of pure SQL (SQL-92) based approaches alone• Make use of object-relational extensions like UDFs, BLOBs, Table functions etc.– Get orders of magnitude improvement• S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD’989CS590D 17Challenges of Frequent Pattern Mining• Challenges– Multiple scans of transaction database– Huge number of candidates– Tedious workload of support counting for candidates• Improving Apriori: general ideas– Reduce passes of

View Full Document