UnsupervisedUnsupervisedData MiningData MiningUnsupervised Unsupervised Data MiningData MiningAssociation Rule LearninggAssociation Rule Analysis Popular in mining data bases Automated discovery of sets of variables that occur frequently or one(s) leading to other(s)2PR , ANN, & MLAssociation Rule Analysis (cont)3PR , ANN, & MLMarket Basket Analysis Retail outletsPlacement of merchandises (affinity positioning)Placement of merchandises (affinity positioning) Cross advertisingBkBanks Insurance link analysis for fraud Medical symptom analysis4PR , ANN, & MLCo-occurrence MatrixCustomer 1: beer, pretzels, potato chips, aspirinCustomer 2: diapers, baby lotion, grapefruit juice, baby food, milkCustomer 3: soda, potato chips, milkCustomer 3: soda, potato chips, milkCustomer 4: soup, beer, milk, ice creamCustomer 5: soda, coffee, milk, breadCustomer 6: beer, potato chips Interesting cases can have 10^4 variables and 10^8 of samplesCiliiiti5Co-occurrence gives only pair-wise association PR , ANN, & MLPractical Solutions Run up against curse-of-dimensionalitiesWith 10^4 variables each with many possibleWith 10^4 variables, each with many possible values, need very large # of samples to populate the space,“bump”hunting in fine scale is notthe space, bump hunting in fine scale is not possible Look for regions in the probability spaces with high density Even for binary variables, there are 2^k (e.g., 2^{1 000} ibl 1 0lh2^{1,000} possible 1,0-tuples, must have efficient search algorithms 6PR , ANN, & MLSimplification Assuming binary variablesIf t f th bi iIf not, force them binaries Instead of 6 different education levels, just 2 (ll db bl )(college and above, or below) Change of variables Initially (X1,…, Xp) Each with (S1, … Sp) possible values K = S1+ … Sp Create Zk binary variables7 1 if the corresponding variable Xi assuming value Sij 0 otherwisePR , ANN, & MLApriori Algorithm Threshold t 1stpass: Single-variable set: must have occurrence larger than t 2ndpass: Pair-wise variable sets: together must have occurrence large than t… mth pass: Only those tuples in (m-1)thpass have probability yp()ppyhigher than t are considered To avoid combinatorial explosion, t cannot 8be too lowPR , ANN, & MLTuples to Rules Tuples {Zk} to A=>BA antecedentA antecedent B consequentT(A >B) t b bilit fT(A=>B): support, probability of simultaneously observing A and B P(A&B)C(A=>B) = T(A=>B)/T(A): confidenceC(A=>B) = T(A=>B)/T(A): confidence, probability of P(B|A)L(A=>B) = C(A=>B)/T(B): lift probability ofL(A=>B) = C(A=>B)/T(B): lift, probability of P(A&B)/(P(A)P(B))9PR , ANN, & MLExamples K={peanut butter, jelly, bread}{tbttjll}>bd{peanut butter, jelly} => bread Support of 0.03: if {peanut butter, jelly, bread} appears in 3% of sample baskets Confidence of 82%: if peanut butter and jelly are purchased, then 82% time bread is also Lift of 1.9: If bread appear in 43% of sampleLift of 1.9: If bread appear in 43% of sample baskets, then 0.82/0.43=1.910PR , ANN, & ML11PR , ANN, &
View Full Document