1Data MiningSession 6 – Main ThemeMining Frequent Patterns,Association, and CorrelationsDr. Jean-Claude FranchittiNew York UniversityComputer Science DepartmentCourant Institute of Mathematical SciencesAdapted from course textbook resourcesData Mining Concepts and Techniques (2ndEdition)Jiawei Han and Micheline Kamber222Mining Frequent Patterns,Association, and CorrelationsMining Frequent Patterns,Association, and CorrelationsAgenda11Session OverviewSession Overview33Summary and ConclusionSummary and Conclusion3What is the class about? Course description and syllabus:» http://www.nyu.edu/classes/jcf/g22.3033-002/» http://www.cs.nyu.edu/courses/spring10/G22.3033-002/index.html Textbooks:» Data Mining: Concepts and Techniques (2ndEdition)Jiawei Han, Micheline KamberMorgan KaufmannISBN-10: 1-55860-901-6, ISBN-13: 978-1-55860-901-3, (2006) » Microsoft SQL Server 2008 Analysis Services Step by StepScott CameronMicrosoft PressISBN-10: 0-73562-620-0, ISBN-13: 978-0-73562-620-31 1st Edition (04/15/09)4Session Agenda Basic concepts and a roadmap Scalable frequent itemset mining methods Mining various kinds of association rules From association to correlation analysis Constraint-based association mining Mining colossal patterns Summary5Icons / Metaphors5Common RealizationInformationKnowledge/Competency PatternGovernanceAlignmentSolution Approach622Mining Frequent Patterns,Association, and CorrelationsMining Frequent Patterns,Association, and CorrelationsAgenda11Session OverviewSession Overview33Summary and ConclusionSummary and Conclusion7Mining Frequent Patterns, Association and Correlations – Sub-Topics Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules From association to correlation analysis Constraint-based association mining Mining colossal patterns Summary8What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.9Why Is Freq. Pattern Mining Important? Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications10Basic Concepts: Frequent Patterns itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probabilitythat a transaction contains X) An itemset X is frequent if X’s support is no less than a minsupthresholdCustomerbuys diaperCustomerbuys bothCustomerbuys beerNuts, Eggs, Milk40Nuts, Coffee, Diaper, Eggs, Milk50Beer, Diaper, Eggs30Beer, Coffee, Diaper20Beer, Nuts, Diaper10Items boughtTid11Basic Concepts: Association Rules Find all the rules X Æ Y with minimum support and confidence support, s, probability that a transaction contains X ∪ Y confidence, c, conditional probabilitythat a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3Customerbuys diaperCustomerbuys bothCustomerbuys beerNuts, Eggs, Milk40Nuts, Coffee, Diaper, Eggs, Milk50Beer, Diaper, Eggs30Beer, Coffee, Diaper20Beer, Nuts, Diaper10Items boughtTid Association rules: (many more!) Beer Æ Diaper (60%, 100%) Diaper Æ Beer (60%, 75%)12Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules13Closed Patterns and Max-Patterns Exercise. DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of max-pattern? <a1, …, a100>: 1 What is the set of all patterns? !!14Computational Complexity of Frequent Itemset Mining How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: MNwhere M: # distinct items, and N: max length of transactions The worst case complexty vs. the expected probability Ex. Suppose Walmart has 104kinds of products The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 103times in 109transactions?15Mining Frequent Patterns, Association and Correlations – Sub-Topics Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules From association to correlation analysis Constraint-based association mining Mining colossal patterns Summary16The Downward Closure Property and Scalable Mining Methods The downward closure property of frequent patterns Any subset of a
View Full Document