This preview shows page 1-2-3-4 out of 11 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 11 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CMSC828G Principles of Data Mining Lecture #22• Today’s Reading:– HMS 6.7, ch. 13•Today’s Lecture:– Finding Patterns and Rules• Association Rules• Upcoming Due Dates:– H3 due today– Project Writeup due 4/30Big PicturePredictive ModelsDescriptive Modelsemphasis on a single attribute full description of the dataPatternDiscoverycharacterizes local aspects of dataFinding Patterns and Rules•Examples:– supermarket transaction database, 10% of the customers buy wine and cheese– telecommunications alarms databaseif alarms A and B occur within 30 seconds of each other, then alarm C occurs within 60 seconds with probability 0.5– web log datasetif a person visits the CNN Web site, there is 60% chance the person will visit the ABC News Web site in the same monthMining Rules•Problems?– number of rules grows exponentially with the number of attributes– among the large number of rules generated, how do we determine interesting rules?What Is Association Mining?• Association rule mining:– Finding frequent patterns, associations, correlations, or causalstructures among sets of items or objects in transaction databases, relational databases, and other information repositories.• Applications:– Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.• Examples. – Rule form: “Body → Ηead [support, confidence]”.– buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]– major(x, “CS”) ^ takes(x, “DB”) → grade(x, “A”) [1%, 75%]Association Rule Flavors• Boolean vs. quantitative associations (Based on the types of values handled)– buys(x, “SQLServer”) ^ buys(x, “DMBook”) → buys(x, “DBMiner”) [0.2%, 60%]– age(x, “30..39”) ^ income(x, “42..48K”) → buys(x, “PC”) [1%, 75%]• Single dimension vs. multiple dimensional associations• Single level vs. multiple-level analysis– What brands of beers are associated with what brands of diapers?• Various extensions– Correlation, causality analysis• Association does not necessarily imply correlation or causality– Maxpatterns and closed itemsets– Constraints enforced• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?2Market Basket Analysis• Given some set of items1. database of transactions, 2. each transaction is a list of items (purchased by a customer in a visit), called a basket•Find: all rules that correlate the presence of one set of items with that of another set of items in a basket– E.g., 98% of people who purchase tires and auto accessories also get automotive services done• Other problems with this structure:– baskets = documents; items = words– baskets = web pages; items = linksMarket Basket Analysis, cont• Typically transaction database too large to fit in main memory– common sizes number of transactions: 105–108number of items: 102–106– typically sparse, 0.1% chance of random purchase –commonly stored • in a relational data base, Baskets(BID,item) • flat file (BID,item1,item2,…,itemn)• Evaluating running time:– count the number of passes through the data– principle cost time to read data from diskRule Measures: Support and ConfidenceFind all the rules X & Y ⇒ Z with minimum confidence and supportsupport,s, proportion of transactions containing{X, Y, Z}confidence,c,proportion of transactions which have {X, Y} and also contain ZCustomerbuys diaperCustomerbuys bothCustomerbuys beerExample: Support and ConfidenceTransaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,FLet minimum support 50%, and minimum confidence 50%, we have:A ⇒ C S = ? C = ?50% 66.6%C ⇒ A S = ? C = ?50% 100%Mining Association Rules—An ExampleThe Apriori principle:Any subset of a frequent itemset must be frequentTransaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,FFrequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%Min. support 50%Min. confidence 50%For rule A⇒C:support = support({A,C}) = 50%confidence = support({A,C})/support({A}) = 66.6%Finding Frequent Itemsets•Find the frequent itemsets: the sets of items that have minimum support– Monotonicity principle: A subset of a frequent itemsetmust also be a frequent itemset• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset– Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)• Use the frequent itemsets to generate association rules.3Finding frequent itemsets, cont.•Two approaches– Proceed levelwise, first find frequent items (sets of size 1), then frequent pairs, next frequent triples• most time consuming step is often finding pairs• one pass over the data is made for each level– Find all maximal frequent itemsets (I.e., sets S such that no proper superset of S is frequent)• one or several passes over the dataThe Apriori Algorithm•Join Step: Ckis generated by joining Lk-1with itself• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemsetCk: Candidate itemset of size kLk: frequent itemset of size kL1= {frequent items};for (k= 1; Lk!=∅; k++) do beginCk+1= candidates generated from Lk;for each transaction tin database doincrement the count of all candidates in Ck+1that are contained in tLk+1= candidates in Ck+1with min_supportendreturn ∪kLk;The Apriori Algorithm — ExampleTID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5Database Ditemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3Scan DC1itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2C2Scan Ditemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}C2L3Scan Ditemset sup{2 3 5} 2C3itemset{2 3 5}itemset sup.{1} 2{2} 3{3} 3{5} 3L1itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2L2How to Generate Candidates?• Suppose the items in Lk-1are ordered• Step 1: self-joining Lk-1insert intoCkselect p.item1, p.item2, …, p.itemk-1, q.itemk-1from Lk-1p, Lk-1 qwhere p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1• Step 2: pruningforall itemsets c in Ckdoforall (k-1)-subsets s of c doif (s is not in Lk-1) then delete cfrom CkHow to Count Supports of Candidates?• Why counting supports of candidates a problem?– The total number of candidates can be very huge– One transaction may contain many candidates•Method:– Candidate itemsets are stored in a hash-tree–Leaf node of hash-tree contains a list of itemsets and counts–Interior node contains a hash table–Subset function: finds all the candidates contained in a transactionExample of


View Full Document

UMD CMSC 828G - Principles of Data Mining

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download Principles of Data Mining
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Principles of Data Mining and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Principles of Data Mining 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?