CSE 634 Data Mining TechniquesReferencesOverviewBasic Concepts of Association Rule MiningAssociation Model: Problem StatementRule Measures: Support & ConfidenceSupport & Confidence : An ExampleTypes of Association Rule MiningSlide 9The Apriori Algorithm: BasicsThe Apriori Algorithm in a NutshellThe Apriori Algorithm : Pseudo codeThe Apriori Algorithm: ExampleStep 1: Generating 1-itemset Frequent PatternStep 2: Generating 2-itemset Frequent PatternStep 2: Generating 2-itemset Frequent Pattern [Cont.]Step 3: Generating 3-itemset Frequent PatternStep 3: Generating 3-itemset Frequent Pattern [Cont.]Step 4: Generating 4-itemset Frequent PatternStep 5: Generating Association Rules from Frequent ItemsetsStep 5: Generating Association Rules from Frequent Itemsets [Cont.]Slide 22Slide 23Methods to Improve Apriori’s EfficiencySlide 25Mining Frequent Patterns Without Candidate GenerationFP-Growth Method : An ExampleFP-Growth Method: Construction of FP-TreeSlide 29Mining the FP-Tree by Creating Conditional (sub) pattern basesFP-Tree Example ContinuedSlide 32Why Frequent Pattern Growth Fast ?Slide 34Association & CorrelationCorrelation ConceptsCorrelation Concepts [Cont.]Slide 38Correlation RulesSummaryQuestions ?CSE 634Data Mining TechniquesMining Association Rules in Large DatabasesPrateek Duble(105301354)Course Instructor: Prof. Anita WasilewskaState University of New York, Stony BrookState University of New York, Stony Brook 2ReferencesData Mining: Concepts & Techniques by Jiawei Han and Micheline KamberPresentation Slides of Prof. Anita WasilewskaPresentation Slides of the Course Book.“An Effective Hah Based Algorithm for Mining Association Rules” (Apriori Algorithm) by J.S. Park, M.S. Chen & P.S.Yu , SIGMOD Conference, 1995. “Mining Frequent Patterns without candidate generation” (FP-Tree Method) by J. Han, J. Pei , Y. Yin & R. Mao , SIGMOD Conference, 2000.State University of New York, Stony Brook 3OverviewBasic Concepts of Association Rule MiningThe Apriori Algorithm (Mining single dimensional boolean association rules)Methods to Improve Apriori’s EfficiencyFrequent-Pattern Growth (FP-Growth) MethodFrom Association Analysis to Correlation AnalysisSummaryState University of New York, Stony Brook 4Basic Concepts of Association Rule MiningAssociation Rule MiningFinding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.ApplicationsBasket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.ExamplesRule form: “Body Head [support, confidence]”.buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]State University of New York, Stony Brook 5Association Model: Problem StatementI ={i1, i2, ...., in} a set of items J = P(I ) set of all subsets of the set of items, elements of J are called itemsetsTransaction T: T is subset of IData Base: set of transactionsAn association rule is an implication of the form : X-> Y, where X, Y are disjoint subsets of I (elements of J )Problem: Find rules that have support and confidence greater that user-specified minimum support and minimun confidenceState University of New York, Stony Brook 6Rule Measures: Support & Confidence Simple Formulas:Confidence (AB) = #tuples containing both A & B / #tuples containing A = P(B|A) = P(A U B ) / P (A) Support (AB) = #tuples containing both A & B/ total number of tuples = P(A U B) What do they actually mean ?Find all the rules X & Y Z with minimum confidence and supportsupport, s, probability that a transaction contains {X, Y, Z}confidence, c, conditional probability that a transaction having {X, Y} also contains ZState University of New York, Stony Brook 7Support & Confidence : An ExampleLet minimum support 50%, and minimum confidence 50%, then we have,A C (50%, 66.6%)C A (50%, 100%)State University of New York, Stony Brook 8Types of Association Rule MiningBoolean vs. quantitative associations (Based on the types of values handled)buys(x, “SQLServer”) ^ income(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%]age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]Single dimension vs. multiple dimensional associations (see ex. Above)Single level vs. multiple-level analysisWhat brands of beers are associated with what brands of diapers?Various extensionsCorrelation, causality analysisAssociation does not necessarily imply correlation or causalityConstraints enforcedE.g., small sales (sum < 100) trigger big buys (sum > 1,000)?State University of New York, Stony Brook 9OverviewBasic Concepts of Association Rule MiningThe Apriori Algorithm (Mining single dimensional boolean association rules)Methods to Improve Apriori’s EfficiencyFrequent-Pattern Growth (FP-Growth) MethodFrom Association Analysis to Correlation AnalysisSummaryState University of New York, Stony Brook 10The Apriori Algorithm: BasicsThe Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules.Key Concepts :•Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith-Itemset).•Apriori Property: Any subset of frequent itemset must be frequent.•Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with itself.State University of New York, Stony Brook 11The Apriori Algorithm in a NutshellFind the frequent itemsets: the sets of items that have minimum supportA subset of a frequent itemset must also be a frequent itemseti.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsetIteratively find frequent itemsets with cardinality from 1 to k (k-itemset)Use the frequent itemsets to generate association rules.State University of New York, Stony Brook 12The Apriori Algorithm : Pseudo codeJoin Step: Ck is generated by joining Lk-1with itselfPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemsetPseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
View Full Document