DOC PREVIEW
CMU CS 15826 - Lecture

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CMU SCS15-826: Multimedia Databasesand Data MiningData Mining - trees + assoc. rulesC. Faloutsos15-826 Copyright: C. Faloutsos (2005) 2CMU SCSOutlineGoal: ‘Find similar / interesting things’• Intro to DB• Indexing - similarity search• Data Mining15-826 Copyright: C. Faloutsos (2005) 3CMU SCSData Mining - Detailed outline• Statistics• AI - decision trees• DB– data warehouses; data cubes; OLAP– classifiers– association rules– misc. topics:• reconstruction of info• network databases; time sequence forecasting15-826 Copyright: C. Faloutsos (2005) 4CMU SCSClassifiers - outline• Case study: ‘Interval Classifier’ (‘IC’)• recent developments and variations15-826 Copyright: C. Faloutsos (2005) 5CMU SCSTree ClassifiersDatabase issues: how about huge (training) datasets?Case study: Interval Classifier [Agrawal+92]Goal: build a classifier (eg., for target mailing)Differences from AI/ML:• retrieval efficiency (could use DBMS indices!)• generation efficiency (large training dataset)15-826 Copyright: C. Faloutsos (2005) 6CMU SCSTree Classifiers - ‘IC’Proposed method: use classification tree, but• split a range (= num. attribute) into k sub-ranges,as opposed to just 2• do ‘dynamic pruning’ (ie., don’ t expand a nodethat is fairly homogeneous)215-826 Copyright: C. Faloutsos (2005) 7CMU SCSTree Classifiers - ‘IC’Sketch of algorithmmake-tree():partition set in groups by labelobtain histograms for each group and each attributeApply goodness function to pick winning attribute A’Partition the domain of A’ into “strong” and “weak”intervalsFor each “strong” interval: assign it to majority labelFor each “weak” interval: make-tree()15-826 Copyright: C. Faloutsos (2005) 8CMU SCSTree Classifiers - ‘IC’• “ strong” interval: = homogeneous (or closeenough)• k: depends on # of distinct values• ‘interval’ = ‘range’ for a continuous attribute;• ‘interval’ = ‘value’ for a categorical one• histograms: equi-widthClassification accuracy: comparable to standardalgorithms (ID3, C4)15-826 Copyright: C. Faloutsos (2005) 9CMU SCSTree Classifiers - ‘IC’Conclusions: compared to standard algorithms (ID3,C4):• Faster, because of– k-way splitting and– dynamic pruning• comparable classification accuracy15-826 Copyright: C. Faloutsos (2005) 10CMU SCSClassifiers - outline• Case study: ‘Interval Classifier’ (‘IC’ )• newer developments and variations15-826 Copyright: C. Faloutsos (2005) 11CMU SCSClassifiers - newer methods• SLIQ [Mehta+96]• SPRINT [Shafer+,vldb96]• PUBLIC [Rastogi+Shim, vldb98]• RainForest [Gehrke+,2000]Goal: how to make build decision trees, when thetraining set does not fit in memory15-826 Copyright: C. Faloutsos (2005) 12CMU SCSClassifiers - newer methodsGoal: how to make build decision trees, when thetraining set does not fit in memorySLIQ: use vertical partitioning (att-value, record-id)for each attribute; keep the (label, record-id) list inmain memorySPRINT: like SLIQ, but attach ‘label’ on eachattribute list: (attr-value, label, record-id)315-826 Copyright: C. Faloutsos (2005) 13CMU SCSClassifiers - conclusionsRecent methods: try to improve scalability/speedwith• ‘dynamic’ pruning• elaborate file structures / data placement• parallelism15-826 Copyright: C. Faloutsos (2005) 14CMU SCSData Mining - Detailed outline• Statistics• AI - decision trees• DB– data warehouses; data cubes; OLAP– classifiers– association rules– misc. topics:• reconstruction of info• network databases; time sequence forecasting15-826 Copyright: C. Faloutsos (2005) 15CMU SCSAssociation rules - outline• Main idea [Agrawal+SIGMOD93]• performance improvements• Variations / Applications• Follow-up concepts15-826 Copyright: C. Faloutsos (2005) 16CMU SCSAssociation rules - idea[Agrawal+SIGMOD93]• Consider ‘market basket’ case:(milk, bread)(milk)(milk, chocolate)(milk, bread)• Find ‘interesting things’ , eg., rules of the form: milk, bread -> chocolate | 90%15-826 Copyright: C. Faloutsos (2005) 17CMU SCSAssociation rules - ideaIn general, for a given ruleIj, Ik, ... Im -> Ix | c‘c’ = ‘confidence’ (how often people by Ix, giventhat they have bought Ij, ... Im‘s’ = support: how often people buy Ij, ... Im, Ix15-826 Copyright: C. Faloutsos (2005) 18CMU SCSAssociation rules - ideaProblem definition:• given– a set of ‘market baskets’ (=binary matrix, of Nrows/baskets and M columns/products)– min-support ‘s’ and– min-confidence ‘c’• find– all the rules with higher support and confidence415-826 Copyright: C. Faloutsos (2005) 19CMU SCSAssociation rules - ideaClosely related concept: “ large itemset”Ij, Ik, ... Im, Ixis a ‘large itemset’ , if it appears more than ‘min-support’ timesObservation: once we have a ‘large itemset’ , we canfind out the qualifying rules easily (how?)Thus, let’ s focus on how to find ‘large itemsets’15-826 Copyright: C. Faloutsos (2005) 20CMU SCSAssociation rules - ideaNaive solution: scan database once; keep 2**|I|countersDrawback?Improvement?15-826 Copyright: C. Faloutsos (2005) 21CMU SCSAssociation rules - ideaNaive solution: scan database once; keep 2**|I|countersDrawback? 2**1000 is prohibitive...Improvement? scan the db |I| times, looking for 1-,2-, etc itemsetsEg., for |I|=3 items only (A, B, C), we have15-826 Copyright: C. Faloutsos (2005) 22CMU SCSAssociation rules - ideaA BCfirst pass1002002min-sup:1015-826 Copyright: C. Faloutsos (2005) 23CMU SCSAssociation rules - ideaA BCfirst pass1002002min-sup:10A,BA,CB,C15-826 Copyright: C. Faloutsos (2005) 24CMU SCSAssociation rules - ideaAnti-monotonicity property:if an itemset fails to be ‘large’ , so will every supersetof it (hence all supersets can be pruned)Sketch of the (famous!) ‘a-priori’ algorithmLet L(i-1) be the set of large itemsets with i-1elementsLet C(i) be the set of candidate itemsets (of size i)515-826 Copyright: C. Faloutsos (2005) 25CMU SCSAssociation rules - ideaCompute L(1), by scanning the database.repeat, for i=2,3...,‘join’ L(i-1) with itself, to generate C(i)two itemset can be joined, if they agree on their first i-2 elementsprune the itemsets of C(i) (how?)scan the db, finding the counts of the C(i) itemsets - setthis to be L(i)unless L(i) is empty, repeat the loop(see example 6.1 in [Han+Kamber])15-826 Copyright: C. Faloutsos (2005) 26CMU SCSAssociation rules -


View Full Document

CMU CS 15826 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?