Stanford CS 361A - Data Mining - Association Rules - D2380953

Home> Schools> Stanford University> Computer Science (CS) > CS 361A> Data Mining - Association Rules

Stanford CS 361A - Data Mining - Association Rules

Course Cs 361a- Advanced Data Structures and Algorithms

Pages 47

Download Save

Unformatted text preview:

CS 361A (Advanced Data Structures and Algorithms)Association Rules OverviewAssociation RulesMarket-Basket ModelExampleApplication 1 (Retail Stores)Application 2 (Information Retrieval)Application 3 (Web Search)Scale of ProblemSlide 10Slide 11Finding Association RulesComputation ModelFinding Frequent PairsMontonicity PropertyA-Priori AlgorithmMemory Usage – A-PrioriPCY IdeaMemory Usage – PCYPCY AlgorithmMultistage PCY AlgorithmMemory Usage – Multistage PCYFinding Larger ItemsetsApproximation TechniquesSampling AlgorithmSON AlgorithmToivonen’s AlgorithmLow-Support, High-CorrelationMatrix RepresentationColumn SimilarityIdentifying Similar Columns?Key ObservationMin HashingMin-Hash SignaturesSlide 35Implementation TrickSlide 37Comparing SignaturesLocality-Sensitive HashingSlide 40Band-Hash AnalysisLSH SummaryDensifying – Amplification of 1’sSlide 44Using Hamming LSHSummaryReferencesCS 361A 1CS 361A CS 361A (Advanced Data Structures and Algorithms)(Advanced Data Structures and Algorithms)Lecture 20 (Dec 7, 2005)Data Mining: Association RulesRajeev Motwani(partially based on notes by Jeff Ullman)CS 361A2Association Rules OverviewAssociation Rules Overview1. Market Baskets & Association Rules2. Frequent item-sets3. A-priori algorithm4. Hash-based improvements5. One- or two-pass approximations6. High-correlation miningCS 361A3Association RulesAssociation Rules•Two Traditions–DM is science of approximating joint distributions•Representation of process generating data•Predict P[E] for interesting events E–DM is technology for fast counting•Can compute certain summaries quickly•Lets try to use them•Association Rules–Captures interesting pieces of joint distribution–Exploits fast counting technologyCS 361A4Market-Basket ModelMarket-Basket Model•Large Sets–Items A = {A1, A2, …, Am}•e.g., products sold in supermarket–Baskets B = {B1, B2, …, Bn}•small subsets of items in A •e.g., items bought by customer in one transaction•Support – sup(X) = number of baskets with itemset X•Frequent Itemset Problem–Given – support threshold s–Frequent Itemsets – –Find – all frequent itemsetsssup(X) CS 361A5ExampleExample•Items A = {milk, coke, pepsi, beer, juice}.•BasketsB1 = {m, c, b} B2 = {m, p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}•Support threshold s=3•Frequent itemsets {m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}CS 361A6Application 1 (Retail Stores)Application 1 (Retail Stores)•Real market baskets –chain stores keep TBs of customer purchase info–Value?•how typical customers navigate stores•positioning tempting items•suggests “tie-in tricks” – e.g., hamburger sale while raising ketchup price•… •High support needed, or no $$’sCS 361A7Application 2 (Information Retrieval)Application 2 (Information Retrieval)•Scenario 1–baskets = documents–items = words in documents–frequent word-groups = linked concepts.•Scenario 2–items = sentences–baskets = documents containing sentences–frequent sentence-groups = possible plagiarismCS 361A8Application 3 (Web Search)Application 3 (Web Search)•Scenario 1–baskets = web pages–items = outgoing links–pages with similar references  about same topic•Scenario 2–baskets = web pages –items = incoming links–pages with similar in-links  mirrors, or same topicCS 361A9Scale of ProblemScale of Problem•WalMart –sells m=100,000 items–tracks n=1,000,000,000 baskets•Web–several billion pages–one new “word” per page•Assumptions–m small enough for small amount of memory per item–m too large for memory per pair or k-set of items–n too large for memory per basket–Very sparse data – rare for item to be in basketCS 361A10Association RulesAssociation Rules•If-then rules about basket contents–{A1, A2,…, Ak}  Aj–if basket has X={A1,…,Ak}, then likely to have Aj•Confidence – probability of Aj given A1,…,Ak•Support (of rule)sup(X))Asup(X)Aconf(Xjj)Asup(X)Asup(XjjCS 361A11ExampleExampleB1 = {m, c, b} B2 = {m, p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}•Association Rule–{m, b}  c–Support = 2–Confidence = 2/4 = 50%CS 361A12Finding Association RulesFinding Association Rules•Goal – find all association rules such that–support–confidence •Reduction to Frequent Itemsets Problems–Find all frequent itemsets X–Given X={A1, …,Ak}, generate all rules X-Aj  Aj–Confidence = sup(X)/sup(X-Aj)–Support = sup(X)•Observe X-Aj also frequent  support knownscCS 361A13Computation ModelComputation Model•Data Storage–Flat Files, rather than database system–Stored on disk, basket-by-basket•Cost Measure – number of passes–Count disk I/O only–Given data size, avoid random seeks and do linear-scans•Main-Memory Bottleneck–Algorithms maintain count-tables in memory–Limitation on number of counters–Disk-swapping count-tables is disasterCS 361A14Finding Frequent PairsFinding Frequent Pairs•Frequent 2-Sets–hard case already–focus for now, later extend to k-sets•Naïve Algorithm–Counters – all m(m–1)/2 item pairs–Single pass – scanning all baskets–Basket of size b – increments b(b–1)/2 counters •Failure?–if memory < m(m–1)/2 –even for m=100,000CS 361A15Montonicity PropertyMontonicity Property•Underlies all known algorithms•Monotonicity Property–Given itemsets–Then •Contrapositive (for 2-sets)XYandX s})A,sup({As)sup(Ajiissup(Y) ssup(X) CS 361A16A-Priori AlgorithmA-Priori Algorithm•A-Priori – 2-pass approach in limited memory•Pass 1–m counters (candidate items in A)–Linear scan of baskets b–Increment counters for each item in b•Mark as frequent, f items of count at least s•Pass 2–f(f-1)/2 counters (candidate pairs of frequent items)–Linear scan of baskets b –Increment counters for each pair of frequent items in b•Failure – if memory < m + f(f–1)/2CS 361A17Memory Usage – A-PrioriMemory Usage – A-PrioriCandidate ItemsPass 1 Pass 2Frequent ItemsCandidate PairsMEMORYMEMORYCS 361A18PCY IdeaPCY Idea•Improvement upon A-Priori•Observe – during Pass 1, memory mostly idle•Idea–Use idle memory for hash-table H–Pass 1 – hash pairs from b into H–Increment counter at hash location –At end – bitmap of high-frequency hash locations–Pass 2 – bitmap extra

View Full Document


School:
Email:
New Password:
Confirm Password:

Stanford CS 361A - Data Mining - Association Rules

Sign up for free to view:

Please select your school