Duke CPS 296.1 - Lecture 21 - D1607641

Home> Schools> Duke University> (CPS) > CPS 296.1> Lecture 21

DOC PREVIEW

Duke CPS 296.1 - Lecture 21

School name Duke University

Course Cps 296.1- Introduction to Computer Vision

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Incremental Mining ofFrequent ItemsetsCPS 296.1Topics in Database Systems2Mining a growing database• Given: DB, a database of transactions, each containing a set of items• Find: L(DB), the set of all frequent itemsets– A set of items X is frequent if no less thansmin% × | DB | transactions contain X• If we add a set of transaction to the database (i.e., DB ← DB ] MDB), what is L(DB ] MDB)?– Re-computation is not optimal because it ignores the result of mining the old DB3Incorrect approaches• L(DB ] MDB) = L(DB) ∪ L(MDB)?– X can be frequent in DB, but it can be infrequent in MDB and DB ] MDB– And vice versa: X can be frequent in MDB, but it can be infrequent in DB and DB ] MDB!L(DB) is not monotone• L(DB ] MDB) = L(DB) ∩ L(MDB)?– X can be infrequent in DB, but it can be frequent in MDB and DB ] MDB– And vice versa4Positive and negative border• Positive border, bd+(DB): maximal frequent itemsets in DB– X is in the positive border if X is frequent and no proper superset of Xis frequent– Example: bd+(DB) = { ab, c }• Negative border, bd–(DB): minimal infrequent itemsets in DB– X is in the negative border if X is infrequent and no proper subset of Xis infrequent– Example: bd–(DB) = { d, ac, bc }∅abcab ac bcad bd cddabc abd bcdacdabcdfrequent5Facts about negative border• Observation 1: Every 1-itemset is in either L(DB) or bd–(DB)• Observation 2: recall pass k of Apriori– Generate Ck(candidate itemsets of size k) from Lk –1(frequent itemsets of size k –1)– Count Ckto determine Lk(⊆ Ck)! Ck– Lkis the negative border at level k! Apriori counts Ck– Lk! After mining DB, we know itemsets in both L(DB) and bd–(DB), together with their counts– Remember such information to help the incremental mining algorithm6First try at an incremental algorithm• Input: DB, MDB, L(DB) and bd–(DB) together with their counts in DB• Output: L(DB ] MDB) together with their counts in DB ] MDB(" will come back later to this requirement)• Method– Same as Apriori, but– When counting Ck, if X ∈ Ckis in L(DB) or bd–(DB), do not go through DB because the count of X in DB is already known; simply go through MDB!We might save a scan over DB (but not MDB) if all itemsets in Ckhave been counted in DB27A problem of the first try• Each scan over DB may count only a few itemsets # insufficient computation to overlap I/O– Also a problem in Apriori– But aggravated in the incremental algorithm because some of Ckmay have been counted before!In general, a trade-off in level-wise algorithms– If we count an itemset X in the next level, we risk doing useless work because a subset of X (which we are counting at the same time) may turn out to be infrequent8Another problem (?) (slide 1)• Did not use the fact that bd–(DB) and beyond are infrequent in DB•Example– Pass 1• Scan of DB is saved• Say all 1-itemsets turn out to be frequent in DB ] MDB– Pass 2• Scan of DB is needed because ad, bd, cd ∈ C2but they have never been counted in DB• But at least we know they are infrequent in DB; perhaps their counts in MDB are not high enough to make them frequent in DB ] MDB, so we could have avoided scanning DB∅abcab ac bcad bd cddabc abd bcdacdabcdL(DB)bd–(DB)9Another problem (?) (slide 2)• If we also care about bd–(DB ] MDB) with counts, then we still need to count ad, bd, cd in DB– Counts for bd–(DB ] MDB) are needed to make the incremental algorithm ready for next MDB∅abcab ac bcad bd cddabc abd bcdacdabcdL(DB)bd–(DB)10Observation 1• If X is infrequent in DB, then X can be frequent in DB ]MDB only if X is frequent in MDB– Infrequent in both DB and MDB # infrequent in DB ] MDB! Strategy implied–First, mine MDB to find L(MDB) and bd–(MDB) with counts– When counting Ck, if X ∈ Ckis in L(MDB) or bd–(MDB), do not go through MDB because the count of X in MDB is already known– Add the following pruning condition: For any X ∈ Ck, if we already know X ∉ L(DB) and X ∉ L(MDB), remove X from Ck11Observation 2• If none of the itemsets in bd–(DB) becomes frequent in DB ] MDB, then no new itemset will be introduced (i.e., L(DB ] MDB) ⊆ L(DB))–Say X is infrequent in DB– Then there exists Y ⊆ X s.t. Y ∈ bd–(DB)– Since none of the itemsets in bd–(DB) is frequent in DB ]MDB, Y is infrequent in DB ] MDB– That means X ⊇ Y is infrequent in DB ] MDB! Strategy implied–In MDB, count itemsets in bd–(DB) to find their counts in DB ] MDB– If none of these itemsets are frequent in DB ] M DB, there is no need to scan DB at all12Second try (slide 1)! Thomas et al. “An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases.” SIGKDD, 1997•Mine MDB to obtain L(MDB) and bd–(MDB) with counts• While mining MDB, also count itemsets in L(DB) and bd–(DB)• For each itemset in L(DB) and bd–(DB), calculate its count in DB ] MDB(Continue on the next slide)313Second try (slide 2)(Continued from the previous slide)• If none of the itemsets in bd–(DB) is frequent in DB ]MDB, stop and output itemsets in L(DB) and bd–(DB) that are in L(DB ] MDB) or bd–(DB ] MDB), together with their counts• Otherwise, scan DB once– Count all itemsets in C = L(MDB) ∪ bd–(MDB) – L(DB) – bd–(DB) –{ X | ∃Y ∈ L(DB) ∪ bd–(DB) s.t. Y is known to be infrequent in DB ] MDB and Y ⊆ X }– Output itemsets in L(DB), bd–(DB), and C that are in L(DB ] MDB) or bd–(DB ] MDB), together with their counts14Experiments• Not nearly close to the ideal speed-up– Incremental algorithm does not replace Apriori• Smaller MDB means bigger speed-up (usually)• Speed-up is lower for very high support threshold– Apriori makes very few passes anyway• Speed-up is lower for very low support threshold– Probability of the negative border expanding is higher15First vs. second try• Second try (Thomas et al.) scans DB at most once– May need to count lots of itemsets in the same pass– Some of these itemset may not need to be counted•Example?– Also, complete mining of MDB may be unnecessary •Example?• First try scans DB multiple times (up to the number of scans required by Apriori minus one)– Will not scan DB if the second try does not– May count very few itemsets in one pass– Every itemset counted is necessary! Fundamental trade-off in play

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

Duke CPS 296.1 - Lecture 21

Sign up for free to view:

Please select your school