DOC PREVIEW
Duke CPS 296.1 - Mining Frequent Itemsets

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Mining Frequent ItemsetsCPS 296.1Topics in Database Systems2Data mining• Data → knowledge• DBMS meets AI and statistics• Clustering, prediction (classification and regression), association analysis, outlier analysis, evolution analysis, etc.!Usually complex statistical “queries” that are difficult to answer " often specialized algorithms outside DBMS!We will focus on papers related to association rule/frequent itemset mining3Mining frequent itemsets• Given: a large database of transactions, each containing a set of items– Example: market baskets• Find all frequent itemsets– A set of items X is frequentif no less than smin% of all transactions contain X– Examples: {diaper, beer}, {scanner, color printer}TID itemsT001 diaper, milk, candyT002 milk, eggT003 milk, beerT004 diaper, milk, eggT005 diaper, beerT006 milk, beerT007 diaper, beerT008 diaper, milk, beer, candyT009 diaper, milk, beer……4A naïve algorithm• First try– Keep a running count for each possible itemset– For each transaction T, and for each itemset X, if Tcontains X then increment the count for X– Return itemsets with large enough counts• Problem: The number of itemsets is huge!–2n, where n is the number of items• Think: How do we prune the search space?5The Apriori property• All subsets of a frequent itemset must also be frequent– Because any transaction that contains X must also contains subsets of X!If we have already verified that X is infrequent, there is no need to count X’s supersets because they must be infrequent too6The Apriori algorithm! Agrawal and Srikant. “Fast Algorithms for Mining Association Rules.” VLDB 1994• Multiple passes over the transactions• Pass k finds all frequent k-itemsets (itemset of size k)• Use the set of frequent (k – 1)-itemsets found in the previous pass to narrow the search for k-itemsets27Pseudo-code for AprioriScan the transactions to find L1, the set of all frequent 1-itemsets, together with their counts;for (k = 2; Lk –1≠ ∅; k++) {Generate Ck, the set of candidate k-itemsets,from Lk –1, the set of frequent (k – 1)-itemsets foundin the previous step;Scan the transactions to count the occurrencesof itemsets in Ck ;Find Lk, a subset of Ckcontaining k-itemsets withcounts no less than (smin% · total # of transactions); }Return L1∪ L2∪ … ∪ Lk;8Candidate generationFrom Lk –1to Ck• Join: combine frequent (k – 1)-itemsets to form candidate k-itemsets• Prune: ensure every size-(k – 1) subset of a candidate is frequent9Example: pass 1itemset count{A} 6{B} 7{C} 6{D} 2{E} 2L1Itemset { F } is infrequentTID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FTransactionssmin% = 20%10Example: pass 2itemset count{A} 6{B} 7{C} 6{D} 2{E} 2L1TID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FL1Transactionsitemset{A, B}{A, C}{A, D}{A, E}{B, C}{B, D}{B, E}{C, D}{C, E}{D, E}C2Generatecandidatesitemset count{A, B} 4{A, C} 4{A, D} 1{A, E} 2{B, C} 4{B, D} 2{B, E} 2{C, D} 0{C, E} 1{D, E} 0C2Scan andcountitemset count{A, B} 4{A, C} 4{A, E} 2{B, C} 4{B, D} 2{B, E} 2L2Checkmin. supportsmin% = 20%11Example: pass 3TID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FTransactionsitemset count{A, B} 4{A, C} 4{A, E} 2{B, C} 4{B, D} 2{B, E} 2L2itemset{A, B, C}{A, B, E}{A, C, E}{B, C, D}{B, C, E}{B, D, E}C3Generatecandidatesitemset count{A, B, C} 2{A, B, E} 2C3Scan andcountCheckmin. supportitemset count{A, B, C} 2{A, B, E} 2L3smin% = 20%12Example: pass 4TID itemsT001 A, B, ET002 B, DT003 B, CT004 A, B, DT005 A, CT006 B, CT007 A, CT008 A, B, C, ET009 A, B, CT010 FTransactionsL3C4 and L4are emptyitemset{A, B, C, E}C4Generatecandidatesitemset c ount{A, B, C} 2{A, B, E} 2smin% = 20%313Example: final answeritemset count{A} 6{B} 7{C} 6{D} 2{E} 2L1itemset count{A, B} 4{A, C} 4{A, E} 2{B, C} 4{B, D} 2{B, E} 2L2itemset count{A, B, C} 2{A, B, E} 2L314Other tricks and extensions• Transaction reduction– If a transaction does not contain any frequent k-itemset, remove it from further consideration» AprioriTid, AprioriHybrid, from the same paper• Dynamic itemset counting– Why only introduce candidate itemsets at the end of a scan? Start counting them whenever there is enough support from smaller itemsets– Fewer passes over data» Brin et al., SIGMOD 1997• Parallelization, sampling, incremental mining,


View Full Document

Duke CPS 296.1 - Mining Frequent Itemsets

Documents in this Course
Lecture

Lecture

18 pages

Lecture

Lecture

6 pages

Lecture

Lecture

13 pages

Lecture

Lecture

5 pages

Load more
Download Mining Frequent Itemsets
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Mining Frequent Itemsets and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Mining Frequent Itemsets 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?