Data Anonymization forDatabase PrivacyCS 590B: Database SecurityPage 1 of 30CS 590B: Database SecurityPresented by: Tiancheng LiOutline Background: Data Anonymization and Privacy Anonymizing Databases k-Anonymity l-Diversity t-ClosenessPage 2 of 30 Anonymizing Unstructured Data Anonymization for Web Search Privacy query log personalized search Anonymizing Social NetworksBackground Motivation Data Collection: a large amount of person-specific data (termed microdata) has been collected in recent years, by government agencies, institutions, and organizations. Data Mining: data and knowledge extracted by data mining techniques represent a key asset to the society.Page 3 of 30techniques represent a key asset to the society. Analyzing trends/patterns. Formulating public policies. Regulatory Laws: some collected data must be made public. These lead to the trend of microdata publishing. Privacy The data usually contains sensitive information about respondents. Respondents’ privacy may be at risk.Privacy-Preserving Data Publishing Two opposing goals To allow researchers to extract knowledge about the data To protect the privacy of every individual Microdata table Identifier (ID), Quasi-Identifier (QID), Sensitive Attribute (SA)Page 4 of 30ID QID SAName Zipcode Age Sex DiseaseAlice 47677 29 F Ovarian CancerBetty 47602 22 F Ovarian CancerCharles 47678 27 M Prostate CancerDavid 47905 43 M FluEmily 47909 52 F Heart DiseaseFred 47906 47 M Heart DiseaseRe-identification by LinkingName Zipcode Age SexAlice 47677 29 FBob 47983 65 MCarol 47677 22 FDan4753223MVote Registration DataQID SAZipcode Age Sex Disease47677 29 F Ovarian Cancer47602 22 F Ovarian Cancer47678 27 M Prostate Cancer4790543MFluIDNameAliceBettyCharlesDavidThe MicrodataPage 5 of 30Dan4753223MEllen 46789 43 F Anonymization The first step: to remove identifiers Not enough: by linking with external databases Alice has ovarian cancer!4790543MFlu47909 52 F Heart Disease47906 47 M Heart DiseaseDavidEmilyFredReal Threats Fact: 87% of the US citizens can be uniquely linked using only three attributes <Zipcode, DOB, Sex> Sweeney managed to re-identify the medical record of the government of MassachusettsIn Massachusetts, the Group Insurance Commission (GIC) is Page 6 of 30In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC has to publish the data: Sweeney paid $20 and bought the voter registration list for Cambridge Massachusetts:GIC(zip, dob, sex, diagnosis, procedure, ...)GIC(zip, dob, sex, diagnosis, procedure, ...)VOTER(name, party, ..., zip, dob, sex)Real Threats Sweeney managed to re-identify the medical record of the government of Massachusetts William Weld (former governor) lives in Cambridge, hence is in VOTER6 people had his dobdob, sex, zipPage 7 of 306 people had his dob 3 were man (the same sex) 1 in his zipk-Anonymity & Generalization k-Anonymity [L. Sweeney] Each record is indistinguishable from at least k-1 other records These k records form an equivalent class k-Anonymity ensures that linking cannot be performed with confidence > 1/k. GeneralizationReplace with less-specific but semantically-consistent valuesPage 8 of 30Replace with less-specific but semantically-consistent valuesMaleFemale*476**4767747678476022*292722Zipcode Age Sexk-Anonymity & GeneralizationQID SAZipcode Age Sex Disease47677 29 F Ovarian Cancer47602 22 F Ovarian Cancer47678 27 M Prostate CancerQID SAZipcode Age Sex Disease476**476**476**2*2*2****Ovarian CancerOvarian CancerProstate CancerThe Microdata The Generalized TablePage 9 of 3047905 43 M Flu47909 52 F Heart Disease47906 47 M Heart Disease4790*4790*4790*[43,52][43,52][43,52]***FluHeart DiseaseHeart Disease 3-Anonymous table Suppose that the adversary knows Alice’s QI values (47677, 29, F). The adversary does not know which one of the first 3 records corresponds to Alice’s record.Attacks on k-AnonymityZipcode Age DiseaseA 3-anonymous patient tableBob k-Anonymity does not provide privacy if: Sensitive values in an equivalence class lack diversity The attacker has background knowledgeHomogeneity AttackPage 10 of 30476** 2* Heart Disease476** 2* Heart Disease476** 2* Heart Disease4790* ≥40 Flu4790* ≥40 Heart Disease4790* ≥40 Cancer476** 3* Heart Disease476** 3* Cancer476** 3* CancerBobZipcode Age47678 27CarlZipcode Age47673 36Background Knowledge Attackl-Diversity Principle Each equivalence class has at least l well-represented sensitive values Distinct l-diversity Each equivalence class has at least l distinct sensitive valuesProbabilistic inferencePage 11 of 30Probabilistic inference10 records8 records have HIV2 records have other valuesl-Diversity Probabilistic l-diversity The frequency of the most frequent value in an equivalence class is bounded by 1/l. Entropy l-diversityThe entropy of the distribution of sensitive values in each equivalence Page 12 of 30The entropy of the distribution of sensitive values in each equivalence class is at least log(l) Recursive (c,l)-diversity The most frequent value does not appear too frequently r1<c(rl+rl+1+…+rm) where riis the frequency of the i-th most frequent value.Limitations of l-Diversityl-Diversity may be difficult and unnecessary to achieve. A single sensitive attribute Two values: HIV positive (1%) and HIV negative (99%) Very different degrees of sensitivityPage 13 of 30 l-diversity is unnecessary to achieve 2-diversity is unnecessary for an equivalence class that contains only negative records l-diversity is difficult to achieve Suppose there are 10000 records in total To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classesLimitations of l-Diversityl-Diversity is insufficient to prevent attribute disclosure.Skewness Attack Two sensitive values HIV positive (1%) and HIV negative (99%)Page 14 of 30l-Diversity does not consider the overall distribution of sensitive values Serious privacy risk Consider an equivalence class that contains an equal number of positive records and negative records l-diversity does not differentiate: Equivalence class 1: 49 positive + 1 negative Equivalence class 2: 1 positive + 49 negativeLimitations of l-DiversityBobZipAgeZipcode Age Salary Disease476** 2* 20K
View Full Document