Rutgers University CS 336 - Principles of Information and Database Management

Unformatted text preview:

1Principles of Information and Database Management198:336Week 12 – Apr 25Matthew StoneOutlineProject UpdateData Mining: Answers without Queries– Patterns and statistics– Finding frequent item sets– Classification and regression treesProject UpdateOne week left– My office hours tomorrow 4-6– Yangzhe’s office hours 7-9– Wednesday: consultation with Vladin lieu of recitationMake sure you have something working by Wednesday!Project UpdateHand in Monday by 6.– Email to mds.– URL of working system– Zip/Tar file of code– Suggested tourProject UpdateUseful tool: sessionsHttpSession s = request.getSession();Object o = s.getAttribute(“attribute”);s.setAttribute(“another”, o);(Sessions will get lost when server restarts.)Data MiningSQL is about answering specific questionsWhat if you don’t know question to ask?– What’s interesting about this data?– What’s going on here?– What happens a lot?Data mining!2Limits of Data MiningRandomness– Some things just happen for no reason– In large data sets, you may see this a lotLimits of Data MiningSparse data– Beware of breaking up data– The amount of data available decreases exponentially in number of constraintsHuman in the loopSelecting data to exploreCleaning data– Minimizing noise, outliers, discrepancies in format, organizing data into new and better tablesEvaluating results– Understanding what’s happening– Explaining it to the boss Finding frequent item setsProblem of associations– What items go together in a table– Example: market basket• What items tended to be bought together?Finding frequent item setsSample table:transact item transact item111 pen 113 pen111 ink 113 milk111 milk 114 pen111 juice 114 ink112 pen 114 juice112 ink 114 water112 milkFactoidsIn 75% of transactions a pen and ink are purchased togetherIn 25% of transactions milk and juice are purchased together…3DefinitionItemset: a set of itemsSupport: the fractions of transactions that contain all the items in the itemsetFrequent itemsets: all itemsets whose support exceeds some thresholdExampleFrequent itemsets at 70%– {pen}, {ink}, {milk}, {pen,ink}, {pen,milk}Efficient AlgorithmKey property– Every subset of a frequent itemset is also a frequent itemsetAlgorithm step 1Identify the frequent itemsets with one itemAlgorithm step 1Identify the frequent itemsets with one itemselect item from tablegroup by itemhaving count(*) > thresholdAlgorithm step 2IterativelyTry to build larger frequent itemsets out of the ones you’ve found already4Algorithm step 2For each new frequent itemset Ik with k itemsgenerate all itemsets I(k+1) with k+1 itemsScan all the transactions onceCheck if the new itemsets are frequentSet k=k+1Algorithm step 3Stop when no new frequent itemsets are identifiedFinding frequent item setsSample table:transact item transact item111 pen 113 pen111 ink 113 milk111 milk 114 pen111 juice 114 ink112 pen 114 juice112 ink 114 water112 milkMining for association rulesAssociation rules–LHS Þ RHS– LHS is a set of items– RHS is a set of itemsExample{pen} Þ {ink}If a pen is purchased in a xact, it is likely that ink is also purchased in that xact.MeasuresSupport– Percentage of xacts that have LHS È RHSConfidence– Percentage of LHS xacts that also have RHS– Support of (LHS È RHS) / Support of LHSFinding themFirst, find frequent itemsetsCreate possible rules from frequent itemsets– Keep those with high confidence5Example{pen, milk}Support is 75%{pen} Þ {milk}Confidence is 75%{milk} Þ {pen}Confidence is 100%Statistical Perspective Is L Þ R surprising?– Statistical independence– Support tells us:P(L & R)P(L)P(R)– Not so interesting if P(L & R) = P(L) * P(R) Correlation and predictionWant L Þ R to be associated with causalityBasic idea of causality:Even if we intervene to change how value of L is determinedWe still get the same correlation with R.Correlation and PredictionFor example, with {pen} Þ {ink}–If we change why people buy pens,we still want them to buy ink too.– For example, we can lower the price of pens.ProblemThings can go together for other reasonsCARTClassification and regression trees6Tree structured rulesNode either makes prediction– E.g., classify into a particular classOr looks at a variable/field– Tests its value– Discrete fields: test if equals specific case– Numerical fields: test if > threshold– RecurseExampleInsurance info relationage cartype highrisk23 sedan false30 sports false36 sedan false25 truck true30 sedan false23 truck true 30 truck false25 sports true18 sedan falseExample - visualizationAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FClassification treeAgeCar Type<= 25> 25NoNo Yes= Sedan≠ SedanVisualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FFirst splitVisualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FFirst splitSecond split7Top-down greedy algorithmBuildTree(data D)Find the best split of D into D1 and D2BuildTree(D1)BuildTree(D2)Visualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FPossible splitsVisualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FPossible splitsVisualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FPossible splitsVisualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FPossible splitsVisualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FPossible splits8Visualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FPossible splitsVisualization in SpaceAge/cartype sedan sports truck18 F23 F T25 T T30 FFF36 FPossible splitsSupporting this with SQLAttribute-value Class Sets (AVCs)SELECT attribute, class, COUNT(*)FROM tableGROUP BY attribute, classSupporting this with SQLFor exampleSELECT age, highrisk, COUNT(*)FROM InsuranceInfoGROUP BY age, highriskSupporting this with SQLGives a “slice” through the database spaceAge True False18 0 123 1 125 2 030 0 336 0 1Supporting this with SQLEnough to find candidate split valuesAnd determine how pure each set is9Algorithm with SQL supportBuildTree(data D)Scan the data and construct AVC groupUse AVC group to split into D1 and D2BuildTree(D1)BuildTree(D2)CART and StatisticsSparse


View Full Document

Rutgers University CS 336 - Principles of Information and Database Management

Download Principles of Information and Database Management
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Principles of Information and Database Management and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Principles of Information and Database Management 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?