Unformatted text preview:

1Data MiningSession 6 – Main ThemeMining Frequent Patterns,Association, and CorrelationsDr. Jean-Claude FranchittiNew York UniversityComputer Science DepartmentCourant Institute of Mathematical SciencesAdapted from course textbook resourcesData Mining Concepts and Techniques (2ndEdition)Jiawei Han and Micheline Kamber222Mining Frequent Patterns,Association, and CorrelationsMining Frequent Patterns,Association, and CorrelationsAgenda11Session OverviewSession Overview33Summary and ConclusionSummary and Conclusion3What is the class about? Course description and syllabus:» http://www.nyu.edu/classes/jcf/g22.3033-002/» http://www.cs.nyu.edu/courses/spring10/G22.3033-002/index.html Textbooks:» Data Mining: Concepts and Techniques (2ndEdition)Jiawei Han, Micheline KamberMorgan KaufmannISBN-10: 1-55860-901-6, ISBN-13: 978-1-55860-901-3, (2006) » Microsoft SQL Server 2008 Analysis Services Step by StepScott CameronMicrosoft PressISBN-10: 0-73562-620-0, ISBN-13: 978-0-73562-620-31 1st Edition (04/15/09)4Session Agenda Basic concepts and a roadmap Scalable frequent itemset mining methods Mining various kinds of association rules From association to correlation analysis  Constraint-based association mining Mining colossal patterns Summary5Icons / Metaphors5Common RealizationInformationKnowledge/Competency PatternGovernanceAlignmentSolution Approach622Mining Frequent Patterns,Association, and CorrelationsMining Frequent Patterns,Association, and CorrelationsAgenda11Session OverviewSession Overview33Summary and ConclusionSummary and Conclusion7Mining Frequent Patterns, Association and Correlations – Sub-Topics Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules From association to correlation analysis Constraint-based association mining Mining colossal patterns Summary8What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.9Why Is Freq. Pattern Mining Important? Freq. pattern: An intrinsic and important property of datasets  Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data  Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient  Semantic data compression: fascicles Broad applications10Basic Concepts: Frequent Patterns itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probabilitythat a transaction contains X) An itemset X is frequent if X’s support is no less than a minsupthresholdCustomerbuys diaperCustomerbuys bothCustomerbuys beerNuts, Eggs, Milk40Nuts, Coffee, Diaper, Eggs, Milk50Beer, Diaper, Eggs30Beer, Coffee, Diaper20Beer, Nuts, Diaper10Items boughtTid11Basic Concepts: Association Rules Find all the rules X Æ Y with minimum support and confidence support, s, probability that a transaction contains X ∪ Y confidence, c, conditional probabilitythat a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3Customerbuys diaperCustomerbuys bothCustomerbuys beerNuts, Eggs, Milk40Nuts, Coffee, Diaper, Eggs, Milk50Beer, Diaper, Eggs30Beer, Coffee, Diaper20Beer, Nuts, Diaper10Items boughtTid Association rules: (many more!) Beer Æ Diaper (60%, 100%) Diaper Æ Beer (60%, 75%)12Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99)  An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X (proposed by Bayardo @ SIGMOD’98) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules13Closed Patterns and Max-Patterns Exercise. DB = {<a1, …, a100>, < a1, …, a50>}  Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of max-pattern? <a1, …, a100>: 1 What is the set of all patterns? !!14Computational Complexity of Frequent Itemset Mining How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: MNwhere M: # distinct items, and N: max length of transactions The worst case complexty vs. the expected probability Ex. Suppose Walmart has 104kinds of products  The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 103times in 109transactions?15Mining Frequent Patterns, Association and Correlations – Sub-Topics Basic concepts and a road map Scalable frequent itemset mining methods Mining various kinds of association rules From association to correlation analysis  Constraint-based association mining Mining colossal patterns Summary16The Downward Closure Property and Scalable Mining Methods The downward closure property of frequent patterns Any subset of a


View Full Document

NYU CSCI-GA 3033 - Data Mining

Documents in this Course
Design

Design

2 pages

Real Time

Real Time

17 pages

Load more
Download Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?