DOC PREVIEW
Duke CPS 296.1 - Lecture 21

This preview shows page 1 out of 3 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 3 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Incremental Mining ofFrequent ItemsetsCPS 296.1Topics in Database Systems2Mining a growing database• Given: DB, a database of transactions, each containing a set of items• Find: L(DB), the set of all frequent itemsets– A set of items X is frequent if no less thansmin% × | DB | transactions contain X• If we add a set of transaction to the database (i.e., DB ← DB ] MDB), what is L(DB ] MDB)?– Re-computation is not optimal because it ignores the result of mining the old DB3Incorrect approaches• L(DB ] MDB) = L(DB) ∪ L(MDB)?– X can be frequent in DB, but it can be infrequent in MDB and DB ] MDB– And vice versa: X can be frequent in MDB, but it can be infrequent in DB and DB ] MDB!L(DB) is not monotone• L(DB ] MDB) = L(DB) ∩ L(MDB)?– X can be infrequent in DB, but it can be frequent in MDB and DB ] MDB– And vice versa4Positive and negative border• Positive border, bd+(DB): maximal frequent itemsets in DB– X is in the positive border if X is frequent and no proper superset of Xis frequent– Example: bd+(DB) = { ab, c }• Negative border, bd–(DB): minimal infrequent itemsets in DB– X is in the negative border if X is infrequent and no proper subset of Xis infrequent– Example: bd–(DB) = { d, ac, bc }∅abcab ac bcad bd cddabc abd bcdacdabcdfrequent5Facts about negative border• Observation 1: Every 1-itemset is in either L(DB) or bd–(DB)• Observation 2: recall pass k of Apriori– Generate Ck(candidate itemsets of size k) from Lk –1(frequent itemsets of size k –1)– Count Ckto determine Lk(⊆ Ck)! Ck– Lkis the negative border at level k! Apriori counts Ck– Lk! After mining DB, we know itemsets in both L(DB) and bd–(DB), together with their counts– Remember such information to help the incremental mining algorithm6First try at an incremental algorithm• Input: DB, MDB, L(DB) and bd–(DB) together with their counts in DB• Output: L(DB ] MDB) together with their counts in DB ] MDB(" will come back later to this requirement)• Method– Same as Apriori, but– When counting Ck, if X ∈ Ckis in L(DB) or bd–(DB), do not go through DB because the count of X in DB is already known; simply go through MDB!We might save a scan over DB (but not MDB) if all itemsets in Ckhave been counted in DB27A problem of the first try• Each scan over DB may count only a few itemsets # insufficient computation to overlap I/O– Also a problem in Apriori– But aggravated in the incremental algorithm because some of Ckmay have been counted before!In general, a trade-off in level-wise algorithms– If we count an itemset X in the next level, we risk doing useless work because a subset of X (which we are counting at the same time) may turn out to be infrequent8Another problem (?) (slide 1)• Did not use the fact that bd–(DB) and beyond are infrequent in DB•Example– Pass 1• Scan of DB is saved• Say all 1-itemsets turn out to be frequent in DB ] MDB– Pass 2• Scan of DB is needed because ad, bd, cd ∈ C2but they have never been counted in DB• But at least we know they are infrequent in DB; perhaps their counts in MDB are not high enough to make them frequent in DB ] MDB, so we could have avoided scanning DB∅abcab ac bcad bd cddabc abd bcdacdabcdL(DB)bd–(DB)9Another problem (?) (slide 2)• If we also care about bd–(DB ] MDB) with counts, then we still need to count ad, bd, cd in DB– Counts for bd–(DB ] MDB) are needed to make the incremental algorithm ready for next MDB∅abcab ac bcad bd cddabc abd bcdacdabcdL(DB)bd–(DB)10Observation 1• If X is infrequent in DB, then X can be frequent in DB ]MDB only if X is frequent in MDB– Infrequent in both DB and MDB # infrequent in DB ] MDB! Strategy implied–First, mine MDB to find L(MDB) and bd–(MDB) with counts– When counting Ck, if X ∈ Ckis in L(MDB) or bd–(MDB), do not go through MDB because the count of X in MDB is already known– Add the following pruning condition: For any X ∈ Ck, if we already know X ∉ L(DB) and X ∉ L(MDB), remove X from Ck11Observation 2• If none of the itemsets in bd–(DB) becomes frequent in DB ] MDB, then no new itemset will be introduced (i.e., L(DB ] MDB) ⊆ L(DB))–Say X is infrequent in DB– Then there exists Y ⊆ X s.t. Y ∈ bd–(DB)– Since none of the itemsets in bd–(DB) is frequent in DB ]MDB, Y is infrequent in DB ] MDB– That means X ⊇ Y is infrequent in DB ] MDB! Strategy implied–In MDB, count itemsets in bd–(DB) to find their counts in DB ] MDB– If none of these itemsets are frequent in DB ] M DB, there is no need to scan DB at all12Second try (slide 1)! Thomas et al. “An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases.” SIGKDD, 1997•Mine MDB to obtain L(MDB) and bd–(MDB) with counts• While mining MDB, also count itemsets in L(DB) and bd–(DB)• For each itemset in L(DB) and bd–(DB), calculate its count in DB ] MDB(Continue on the next slide)313Second try (slide 2)(Continued from the previous slide)• If none of the itemsets in bd–(DB) is frequent in DB ]MDB, stop and output itemsets in L(DB) and bd–(DB) that are in L(DB ] MDB) or bd–(DB ] MDB), together with their counts• Otherwise, scan DB once– Count all itemsets in C = L(MDB) ∪ bd–(MDB) – L(DB) – bd–(DB) –{ X | ∃Y ∈ L(DB) ∪ bd–(DB) s.t. Y is known to be infrequent in DB ] MDB and Y ⊆ X }– Output itemsets in L(DB), bd–(DB), and C that are in L(DB ] MDB) or bd–(DB ] MDB), together with their counts14Experiments• Not nearly close to the ideal speed-up– Incremental algorithm does not replace Apriori• Smaller MDB means bigger speed-up (usually)• Speed-up is lower for very high support threshold– Apriori makes very few passes anyway• Speed-up is lower for very low support threshold– Probability of the negative border expanding is higher15First vs. second try• Second try (Thomas et al.) scans DB at most once– May need to count lots of itemsets in the same pass– Some of these itemset may not need to be counted•Example?– Also, complete mining of MDB may be unnecessary •Example?• First try scans DB multiple times (up to the number of scans required by Apriori minus one)– Will not scan DB if the second try does not– May count very few itemsets in one pass– Every itemset counted is necessary! Fundamental trade-off in play


View Full Document

Duke CPS 296.1 - Lecture 21

Documents in this Course
Lecture

Lecture

18 pages

Lecture

Lecture

6 pages

Lecture

Lecture

13 pages

Lecture

Lecture

5 pages

Load more
Download Lecture 21
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 21 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 21 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?