BU CS 565 - Mining Association Rules in Large Databases - D1053879

Home> Schools> Boston University> Computer Science (CS) > CS 565> Mining Association Rules in Large Databases

DOC PREVIEW

BU CS 565 - Mining Association Rules in Large Databases

School name Boston University

Course Cs 565- Advanced Java Programming

Pages 49

This preview shows page 1-2-3-23-24-25-26-47-48-49 out of 49 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 49 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Mining Association Rules in Large DatabasesAssociation rules • Given a set of transactions D, find rules that will predict the occurrence of an item (or a set of items) based on the occurrences of other items in the transactionMarket-Basket transactionsTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Examples of association rules{Diaper} {Beer},{Milk, Bread} {Diaper,Coke},{Beer, Bread} {Milk},An even simpler concept: frequent itemsets• Given a set of transactions D, find combination of items that occur frequentlyMarket-Basket transactionsTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Examples of frequent itemsets{Diaper, Beer},{Milk, Bread} {Beer, Bread, Milk},Lecture outline• Task 1: Methods for finding all frequent itemsets efficiently • Task 2: Methods for finding association rules efficientlyDefinition: Frequent Itemset• Itemset– A set of one or more items• E.g.: {Milk, Bread, Diaper}– k-itemset• An itemset that contains k items• Support count ( )– Frequency of occurrence of an itemset(number of transactions it appears)– E.g. ({Milk, Bread,Diaper}) = 2 • Support– Fraction of the transactions in which an itemset appears– E.g. s({Milk, Bread, Diaper}) = 2/5• Frequent Itemset– An itemset whose support is greater than or equal to a minsup thresholdTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, CokeWhy do we want to find frequent itemsets?• Find all combinations of items that occur together• They might be interesting (e.g., in placement of items in a store )• Frequent itemsets are only positive combinations (we do not report combinations that do not occur frequently together)• Frequent itemsets aims at providing a summary for the dataFinding frequent sets• Task: Given a transaction database D and a minsup threshold find all frequent itemsets and the frequency of each set in this collection• Stated differently: Count the number of times combinations of attributes occur in the data. If the count of a combination is above minsup report it.• Recall: The input is a transaction database D where every transaction consists of a subset of items from some universe IHow many itemsets are there? nullAB AC AD AE BC BD BE CD CE DEA B C D EABC ABD ABE ACD ACE ADE BCD BCE BDE CDEABCD ABCE ABDE ACDE BCDEABCDEGiven d items, there are 2dpossible itemsetsWhen is the task sensible and feasible?• If minsup = 0, then all subsets of I will be frequent and thus the size of the collection will be very large• This summary is very large (maybe larger than the original input) and thus not interesting• The task of finding all frequent sets is interesting typically only for relatively large values of minsupA simple algorithm for finding all frequent itemsets ??nullAB AC AD AE BC BD BE CD CE DEA B C D EABC ABD ABE ACD ACE ADE BCD BCE BDE CDEABCD ABCE ABDE ACDE BCDEABCDEBrute-force algorithm for finding all frequent itemsets?• Generate all possible itemsets (lattice of itemsets)– Start with 1-itemsets, 2-itemsets,...,d-itemsets• Compute the frequency of each itemset from the data– Count in how many transactions each itemset occurs• If the support of an itemset is above minsup report it as a frequent itemsetBrute-force approach for finding all frequent itemsets• Complexity?– Match every candidate against each transaction – For M candidates and N transactions, the complexity is~ O(NMw) => Expensive since M = 2d!!!Speeding-up the brute-force algorithm• Reduce the number of candidates (M)– Complete search: M=2d– Use pruning techniques to reduce M• Reduce the number of transactions (N)– Reduce size of N as the size of itemset increases– Use vertical-partitioning of the data to apply the mining algorithms• Reduce the number of comparisons (NM)– Use efficient data structures to store the candidates or transactions– No need to match every candidate against every transactionReduce the number of candidates• Apriori principle (Main observation):– If an itemset is frequent, then all of its subsets must also be frequent• Apriori principle holds due to the following property of the support measure:– The support of an itemset never exceeds the support of its subsets– This is known as the anti-monotone property of support)()()(:, YsXsYXYXExampleTID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke s(Bread) > s(Bread, Beer)s(Milk) > s(Bread, Milk)s(Diaper, Beer) > s(Diaper, Beer, Coke)Found to be InfrequentnullAB AC AD AE BC BD BE CD CE DEA B C D EABC ABD ABE ACD ACE ADE BCD BCE BDE CDEABCD ABCE ABDE ACDE BCDEABCDEIllustrating the Apriori principlenullAB AC AD AE BC BD BE CD CE DEA B C D EABC ABD ABE ACD ACE ADE BCD BCE BDE CDEABCD ABCE ABDE ACDE BCDEABCDEPruned supersetsIllustrating the Apriori principleItem CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3Itemset Count {Bread,Milk,Diaper} 3 Items (1-itemsets)Pairs (2-itemsets)(No need to generatecandidates involving Cokeor Eggs)Triplets (3-itemsets)minsup = 3/5If every subset is considered, 6C1+ 6C2+ 6C3= 41With support-based pruning,6 + 6 + 1 = 13Exploiting the Apriori principle1. Find frequent 1-items and put them to Lk(k=1)2. Use Lkto generate a collection of candidate itemsets Ck+1with size (k+1)3. Scan the database to find which itemsets in Ck+1are frequent and put them into Lk+14. If Lk+1is not empty k=k+1 Goto step 2R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, 1994.The Apriori algorithmCk: Candidate itemsets of size kLk: frequent itemsets of size kL1= {frequent 1-itemsets};for (k = 2; Lk!= ; k++) Ck+1= GenerateCandidates(Lk)for each transaction t in database do increment count of candidates in Ck+1that are contained in tendforLk+1= candidates in Ck+1with support ≥min_supendforreturnkLk;GenerateCandidates• Assume the items in Lkare listed in an order (e.g., alphabetical)• Step 1: self-joining Lk(IN SQL)insert into Ck+1select p.item1, p.item2, …, p.itemk, q.itemkfrom Lkp, Lkqwhere p.item1=q.item1, …, p.itemk-1=q.itemk-1,

View Full Document