BU CS 565 - Recap - Mining association rules from large datasets - D1008883

Home> Schools> Boston University> Computer Science (CS) > CS 565> Recap - Mining association rules from large datasets

BU CS 565 - Recap - Mining association rules from large datasets

Course Cs 565- Advanced Java Programming

Pages 55

Download Save

Unformatted text preview:

Slide 1Slide 2RecapSlide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10How many association rules are there?Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Too many frequent itemsetsFrequent itemsets maybe too many to be helpfulSlide 20Borders of frequent itemsetsPositive and negative borderExamples with bordersDescriptive power of the bordersMaximal patternsMaximal Frequent ItemsetMaximal patternsMaxMiner: Mining Max-patternsLocal Pruning Techniques (e.g. at node A)Algorithm MaxMinerGlobal Pruning Technique (across sub-trees)Closed patternsMaximal vs Closed ItemsetsMaximal vs Closed Frequent ItemsetsWhy are closed patterns interesting?Why closed patterns are interesting?Maximal vs Closed setsSlide 38Prototype problems: Covering problemsPrototype covering problemsSet-cover problemTrivial algorithmGreedy algorithm for set coverAs an algorithmHow can this go wrong?How good is the greedy algorithm?How good is the greedy algorithm for set cover?How good is the greedy algorithm for set cover? A tighter boundBest-collection problemGreedy approximation algorithm for the best-collection problemBasic theoremSubmodular functions and the greedy algorithmSlide 53Approximating a collection of frequent patternsSlide 55Recap: Mining association rules from large datasetsRecap•Task 1: Methods for finding all frequent itemsets efficiently •Task 2: Methods for finding association rules efficientlyRecap•Frequent itemsets (measure: support)•Apriori principle •Apriori algorithm for finding frequent itemsets–Prunes really well in practice–Makes multiple passes over the datasetMaking a single pass over the data: the AprioriTid algorithm•The database is not used for counting support after the 1st pass!•Instead information in data structure Ck’ is used for counting support in every step•Ck’ is generated from Ck-1’•For small values of k, storage requirements for data structures could be larger than the database!•For large values of k, storage requirements can be very smallLecture outline•Task 1: Methods for finding all frequent itemsets efficiently •Task 2: Methods for finding association rules efficientlyDefinition: Association RuleLet D be database of transactions–e.g.:•Let I be the set of items that appear in the database, e.g., I={A,B,C,D,E,F}•A rule is defined by X  Y, where XI, YI, and XY=–e.g.: {B,C}  {A} is a ruleTransaction ID Items2000 A, B, C1000 A, C4000 A, D5000 B, E, FDefinition: Association RuleExample:Beer}Diaper,Milk{ 4.052|T|)BeerDiaper,,Milk(s67.032)Diaper,Milk()BeerDiaper,Milk,(cAssociation RuleAn implication expression of the form X  Y, where X and Y are non-overlapping itemsetsExample: {Milk, Diaper}  {Beer} Rule Evaluation MetricsSupport (s)Fraction of transactions that contain both X and YConfidence (c)Measures how often items in Y appear in transactions thatcontain XTID date items_bought100 10/10/99 {F,A,D,B}200 15/10/99 {D,A,C,E,B}300 19/10/99 {C,A,B,E}400 20/10/99 {B,A,D}ExampleWhat is the support and confidence of the rule: {B,D}  {A}Support:percentage of tuples that contain {A,B,D} =Confidence:D}{B,contain that tuplesofnumber D}B,{A,contain that tuplesofnumber 75%100%Association-rule mining task•Given a set of transactions D, the goal of association rule mining is to find all rules having –support ≥ minsup threshold–confidence ≥ minconf thresholdBrute-force algorithm for association-rule mining •List all possible association rules•Compute the support and confidence for each rule•Prune rules that fail the minsup and minconf thresholds• Computationally prohibitive!How many association rules are there?•Given d unique items in I:–Total number of itemsets = 2d–Total number of possible association rules: 123111 1 dddkkdjjkdkdRIf d=6, R = 602 rulesMining Association Rules•Two-step approach: –Frequent Itemset Generation–Generate all itemsets whose support  minsup–Rule Generation–Generate high confidence rules from each frequent itemset, where each rule is a binary partition of a frequent itemsetRule Generation – Naive algorithm•Given a frequent itemset X, find all non-empty subsets y X such that y X – y satisfies the minimum confidence requirement–If {A,B,C,D} is a frequent itemset, candidate rules:ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB,•If |X| = k, then there are 2k – 2 candidate association rules (ignoring X   and   X)Efficient rule generation•How to efficiently generate rules from frequent itemsets?–In general, confidence does not have an anti-monotone propertyc(ABC D) can be larger or smaller than c(AB D)–But confidence of rules generated from the same itemset has an anti-monotone property–Example: X = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD)–Why?Confidence is anti-monotone w.r.t. number of items on the RHS of the ruleRule Generation for Apriori AlgorithmLattice of rulesPruned RulesLow Confidence RuleApriori algorithm for rule generation•Candidate rule is generated by merging two rules that share the same prefixin the rule consequent•join(CDAB,BD—>AC)would produce the candidaterule D ABC•Prune rule DABC if there exists asubset (e.g., ADBC) that does not havehigh confidenceCDAB BDACDABCReducing the collection of itemsets: alternative representations and combinatorial problemsToo many frequent itemsets•If {a1, …, a100} is a frequent itemset, then there are1.27*1030 frequent sub-patterns!•There should be some more condensed way to describe the data1210010021001100100Frequent itemsets maybe too many to be helpful•If there are many and large frequent itemsets enumerating all of them is costly.•We may be interested in finding the boundary frequent patterns.•Question: Is there a good definition of such boundary?all itemsempty setFrequent itemsetsNon-frequent itemsetsborderBorders of frequent itemsets•Itemset X is more specific than itemset Y if X superset of Y (notation: Y<X). Also, Y is more general than X (notation: X>Y)•The Border: Let S be a collection of frequent itemsets and P the lattice of itemsets. The

View Full Document


School:
Email:
New Password:
Confirm Password:

BU CS 565 - Recap - Mining association rules from large datasets

Sign up for free to view:

Please select your school