Privacy-Preserving Data MiningReading AssignmentInput PerturbationOutput PerturbationConcepts of PrivacyKullback-Leibler DistancePrivacy of Input PerturbationInput Perturbation ExamplePrivacy DefinitionsExampleSome PropertiesPrivacy BreachesPrivacy Breach: DefinitionsTransition ProbabilitiesAmplificationAmplification: ExampleOutput Perturbation ReduxFormally…Privacy Definition RevisitedLimits of Output PerturbationThe SuLQ AlgorithmComputing with SuLQk-Means ClusteringComputing k-Means with SuLQID3 ClassifiersComputing ID3 with SuLQslide 1Vitaly ShmatikovCS 380SPrivacy-Preserving Data Miningslide 2Reading AssignmentEvfimievski, Gehrke, Srikant. “Limiting Privacy Breaches in Privacy-Preserving Data Mining” (PODS 2003).Blum, Dwork, McSherry, and Nissim. “Practical Privacy: The SuLQ Framework” (PODS 2005).slide 3Input Perturbationx1…xnReveal entire database, but randomize entriesDatabasex1+1…xn+nAdd random noise i to each database entry xi For example, if distribution of noise hasmean 0, user can compute average of xiUserslide 4Output PerturbationRandomize response to each query(S, f)Set of rows Function on rows i f(xi)True response x1…xnDatabaseUserAdd random noise to the true response + slide 5Concepts of PrivacyWeak: no single database entry has been revealedStronger: no single piece of information is revealed (what’s the difference from the “weak” version?)Strongest: the adversary’s beliefs about the data have not changedslide 6Kullback-Leibler DistanceMeasures the “difference” between two probability distributionsslide 7Privacy of Input PerturbationX is a random variable, R is the randomization operator, Y=R(X) is the perturbed databaseNaïve: measure mutual information between original and randomized databases•Average KL distance between (1) distribution of X and (2) distribution of X conditioned on Y=y•Ey (KL(PX|Y=y || Px))–Intuition: if this distance is small, then Y leaks little information about actual values of XWhy is this definition problematic?slide 8Input Perturbation ExampleGladys: 85Doris: 90Beryl: 82Name: AgedatabaseGladys: 72Doris: 110Beryl: 85Age is an integer between 0 and 90Randomize database entries by adding random integers between -20 and 20Randomization operatorhas to be public (why?)Doris’s age is 90!!slide 9Privacy DefinitionsMutual information can be small on average, but an individual randomized value can still leak a lot of information about the original valueBetter: consider some property Q(x)•Adversary has a priori probability Pi that Q(xi) is true Privacy breach if revealing yi=R(xi) significantly changes adversary’s probability that Q(xi) is true•Intuition: adversary learned something about entry xi (namely, likelihood of property Q holding for this entry)slide 10ExampleData: 0x1000, p(x=0)=0.01, p(x0)=0.00099Reveal y=R(x) Three possible randomization operators R•R1(x) = x with prob. 20%; uniform with prob. 80%•R2(x) = x+ mod 1001, uniform in [-100,100]•R3(x) = R2(x) with prob. 50%, uniform with prob. 50%Which randomization operator is better?slide 11Some PropertiesQ1(x): x=0; Q2(x): x{200, ..., 800}What are the a priori probabilities for a given x that these properties hold?•Q1(x): 1%, Q2(x): 40.5%Now suppose adversary learned that y=R(x)=0. What are probabilities of Q1(x) and Q2(x)?•If R = R1 then Q1(x): 71.6%, Q2(x): 83%•If R = R2 then Q1(x): 4.8%, Q2(x): 100%•If R = R3 then Q1(x): 2.9%, Q2(x): 70.8%slide 12Privacy BreachesR1(x) leaks information about property Q1(x) •Before seeing R1(x), adversary thinks that probability of x=0 is only 1%, but after noticing that R1(x)=0, the probability that x=0 is 72%R2(x) leaks information about property Q2(x)•Before seeing R2(x), adversary thinks that probability of x{200, ..., 800} is 41%, but after noticing that R2(x)=0, the probability that x{200, ..., 800} is 100%Randomization operator should be such that posterior distribution is close to the prior distribution for any propertyslide 13Privacy Breach: DefinitionsQ(x) is some property, 1, 2 are probabilities1“very unlikely”, 2“very likely”Straight privacy breach: P(Q(x)) 1, but P(Q(x) | R(x)=y) 2 •Q(x) is unlikely a priori, but likely after seeing randomized value of xInverse privacy breach: P(Q(x)) 2, but P(Q(x) | R(x)=y) 1•Q(x) is likely a priori, but unlikely after seeing randomized value of x[Evfimievski et al.]slide 14Transition ProbabilitiesHow to ensure that randomization operator hides every property?•There are 2|X| properties•Often randomization operator has to be selected even before distribution Px is known (why?)Idea: look at operator’s transition probabilities•How likely is xi to be mapped to a given y?•Intuition: if all possible values of xi are equally likely to be randomized to a given y, then revealing y=R(xi) will not reveal much about actual value of xislide 15AmplificationRandomization operator is -amplifying for y ifFor given 1, 2, no straight or inverse privacy breaches occur if y)p(xy)p(x :V x ,x 21x21 ) -(1) -(12112[Evfimievski et al.]slide 16Amplification: ExampleFor example, for randomization operator R3, p(xy) = ½ (1/201 + 1/1001) if y[x-100,x+100] = 1/2002 otherwiseFractional difference = 1 + 1001/201 < 6 (= )Therefore, no straight or inverse privacy breaches will occur with 1=14%, 2=50%slide 17Output Perturbation ReduxRandomize response to each query(S, f)Set of rows Function on rows i f(xi)True response x1…xnDatabaseUserAdd random noise to the true response + slide 18Formally…Database is n-tuple D = (d1, d2 … dn)•Elements are not random; adversary may have a priori beliefs about their distribution or specific valuesFor any predicate f: D {0,1}, pi,f(n) is the probability that f(di)=1, given the answers to n queries as well as all other entries dj for ji•pi,f(0)=a priori belief, pi,f(t)=belief after t answers•Why is adversary given all entries except di?conf(p) = log p / (1–p)•From raw probability to “belief”slide 19Privacy Definition RevisitedIdea: after each query, adversary’s gain in knowledge about any individual database entry should be small•Gain in knowledge about di as the result of (n+1)st
View Full Document