Duke CPS 296.1 - Mining Frequent Itemsets - D1710500

Home> Schools> Duke University> (CPS) > CPS 296.1> Mining Frequent Itemsets

DOC PREVIEW

Duke CPS 296.1 - Mining Frequent Itemsets

School name Duke University

Course Cps 296.1- Introduction to Computer Vision

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Mining Frequent ItemsetsCPS 296.1Topics in Database Systems2Data mining• Data → knowledge• DBMS meets AI and statistics• Clustering, prediction (classification and regression), association analysis, outlier analysis, evolution analysis, etc.!Usually complex statistical “queries” that are difficult to answer " often specialized algorithms outside DBMS!We will focus on papers related to association rule/frequent itemset mining3Mining frequent itemsets• Given: a large database of transactions, each containing a set of items– Example: market baskets• Find all frequent itemsets– A set of items X is frequentif no less than smin% of all transactions contain X– Examples: {diaper, beer}, {scanner, color printer}TID itemsT001 diaper, milk, candyT002 milk, eggT003 milk, beerT004 diaper, milk, eggT005 diaper, beerT006 milk, beerT007 diaper, beerT008 diaper, milk, beer, candyT009 diaper, milk, beer……4A naïve algorithm• First try– Keep a running count for each possible itemset– For each transaction T, and for each itemset X, if Tcontains X then increment the count for X– Return itemsets with large enough counts• Problem: The number of itemsets is huge!–2n, where n is the number of items• Think: How do we prune the search space?5The Apriori property• All subsets of a frequent itemset must also be frequent– Because any transaction that contains X must also contains subsets of X!If we have already verified that X is infrequent, there is no need to count X’s supersets because they must be infrequent too6The Apriori algorithm! Agrawal and Srikant. “Fast Algorithms for Mining Association Rules.” VLDB 1994• Multiple passes over the transactions• Pass k finds all frequent k-itemsets (itemset of size k)• Use the set of frequent (k – 1)-itemsets found in the previous pass to narrow the search for k-itemsets27Pseudo-code for AprioriScan the transactions to find L1, the set of all frequent 1-itemsets, together with their counts;for (k = 2; Lk –1≠ ∅; k++) {Generate Ck, the set of candidate k-itemsets,from Lk –1, the set of frequent (k – 1)-itemsets foundin the previous step;Scan the transactions to count the occurrencesof itemsets in Ck ;Find Lk, a subset of Ckcontaining k-itemsets withcounts no less than (smin% · total # of transactions); }Return L1∪ L2∪ … ∪ Lk;8Candidate generationFrom Lk –1to Ck• Join: combine frequent (k – 1)-itemsets to form candidate k-itemsets• Prune: ensure every size-(k – 1) subset of a candidate is frequent9Example: pass 1itemset count{A} 6{B} 7{C} 6{D} 2{E} 2L1Itemset { F } is infrequentTID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FTransactionssmin% = 20%10Example: pass 2itemset count{A} 6{B} 7{C} 6{D} 2{E} 2L1TID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FL1Transactionsitemset{A, B}{A, C}{A, D}{A, E}{B, C}{B, D}{B, E}{C, D}{C, E}{D, E}C2Generatecandidatesitemset count{A, B} 4{A, C} 4{A, D} 1{A, E} 2{B, C} 4{B, D} 2{B, E} 2{C, D} 0{C, E} 1{D, E} 0C2Scan andcountitemset count{A, B} 4{A, C} 4{A, E} 2{B, C} 4{B, D} 2{B, E} 2L2Checkmin. supportsmin% = 20%11Example: pass 3TID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FTransactionsitemset count{A, B} 4{A, C} 4{A, E} 2{B, C} 4{B, D} 2{B, E} 2L2itemset{A, B, C}{A, B, E}{A, C, E}{B, C, D}{B, C, E}{B, D, E}C3Generatecandidatesitemset count{A, B, C} 2{A, B, E} 2C3Scan andcountCheckmin. supportitemset count{A, B, C} 2{A, B, E} 2L3smin% = 20%12Example: pass 4TID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FTransactionsL3C4 and L4are emptyitemset{A, B, C, E}C4Generatecandidatesitemset c ount{A, B, C} 2{A, B, E} 2smin% = 20%313Example: final answeritemset count{A} 6{B} 7{C} 6{D} 2{E} 2L1itemset count{A, B} 4{A, C} 4{A, E} 2{B, C} 4{B, D} 2{B, E} 2L2itemset count{A, B, C} 2{A, B, E} 2L314Other tricks and extensions• Transaction reduction– If a transaction does not contain any frequent k-itemset, remove it from further consideration» AprioriTid, AprioriHybrid, from the same paper• Dynamic itemset counting– Why only introduce candidate itemsets at the end of a scan? Start counting them whenever there is enough support from smaller itemsets– Fewer passes over data» Brin et al., SIGMOD 1997• Parallelization, sampling, incremental mining,

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

Duke CPS 296.1 - Mining Frequent Itemsets

Sign up for free to view:

Please select your school