DOC PREVIEW
Purdue CS 59000 - Data Anonymization for Database Privacy

This preview shows page 1-2-14-15-29-30 out of 30 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 30 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Data Anonymization forDatabase PrivacyCS 590B: Database SecurityPage 1 of 30CS 590B: Database SecurityPresented by: Tiancheng LiOutline Background: Data Anonymization and Privacy Anonymizing Databases k-Anonymity  l-Diversity t-ClosenessPage 2 of 30 Anonymizing Unstructured Data Anonymization for Web Search Privacy query log personalized search Anonymizing Social NetworksBackground Motivation Data Collection: a large amount of person-specific data (termed microdata) has been collected in recent years, by government agencies, institutions, and organizations. Data Mining: data and knowledge extracted by data mining techniques represent a key asset to the society.Page 3 of 30techniques represent a key asset to the society. Analyzing trends/patterns. Formulating public policies. Regulatory Laws: some collected data must be made public. These lead to the trend of microdata publishing. Privacy The data usually contains sensitive information about respondents.  Respondents’ privacy may be at risk.Privacy-Preserving Data Publishing Two opposing goals To allow researchers to extract knowledge about the data To protect the privacy of every individual Microdata table Identifier (ID), Quasi-Identifier (QID), Sensitive Attribute (SA)Page 4 of 30ID QID SAName Zipcode Age Sex DiseaseAlice 47677 29 F Ovarian CancerBetty 47602 22 F Ovarian CancerCharles 47678 27 M Prostate CancerDavid 47905 43 M FluEmily 47909 52 F Heart DiseaseFred 47906 47 M Heart DiseaseRe-identification by LinkingName Zipcode Age SexAlice 47677 29 FBob 47983 65 MCarol 47677 22 FDan4753223MVote Registration DataQID SAZipcode Age Sex Disease47677 29 F Ovarian Cancer47602 22 F Ovarian Cancer47678 27 M Prostate Cancer4790543MFluIDNameAliceBettyCharlesDavidThe MicrodataPage 5 of 30Dan4753223MEllen 46789 43 F Anonymization The first step: to remove identifiers Not enough: by linking with external databases Alice has ovarian cancer!4790543MFlu47909 52 F Heart Disease47906 47 M Heart DiseaseDavidEmilyFredReal Threats Fact: 87% of the US citizens can be uniquely linked using only three attributes <Zipcode, DOB, Sex>  Sweeney managed to re-identify the medical record of the government of MassachusettsIn Massachusetts, the Group Insurance Commission (GIC) is Page 6 of 30In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. GIC has to publish the data: Sweeney paid $20 and bought the voter registration list for Cambridge Massachusetts:GIC(zip, dob, sex, diagnosis, procedure, ...)GIC(zip, dob, sex, diagnosis, procedure, ...)VOTER(name, party, ..., zip, dob, sex)Real Threats Sweeney managed to re-identify the medical record of the government of Massachusetts William Weld (former governor) lives in Cambridge, hence is in VOTER6 people had his dobdob, sex, zipPage 7 of 306 people had his dob 3 were man (the same sex) 1 in his zipk-Anonymity & Generalization k-Anonymity [L. Sweeney] Each record is indistinguishable from at least k-1 other records These k records form an equivalent class k-Anonymity ensures that linking cannot be performed with confidence > 1/k.  GeneralizationReplace with less-specific but semantically-consistent valuesPage 8 of 30Replace with less-specific but semantically-consistent valuesMaleFemale*476**4767747678476022*292722Zipcode Age Sexk-Anonymity & GeneralizationQID SAZipcode Age Sex Disease47677 29 F Ovarian Cancer47602 22 F Ovarian Cancer47678 27 M Prostate CancerQID SAZipcode Age Sex Disease476**476**476**2*2*2****Ovarian CancerOvarian CancerProstate CancerThe Microdata The Generalized TablePage 9 of 3047905 43 M Flu47909 52 F Heart Disease47906 47 M Heart Disease4790*4790*4790*[43,52][43,52][43,52]***FluHeart DiseaseHeart Disease 3-Anonymous table Suppose that the adversary knows Alice’s QI values (47677, 29, F). The adversary does not know which one of the first 3 records corresponds to Alice’s record.Attacks on k-AnonymityZipcode Age DiseaseA 3-anonymous patient tableBob k-Anonymity does not provide privacy if: Sensitive values in an equivalence class lack diversity The attacker has background knowledgeHomogeneity AttackPage 10 of 30476** 2* Heart Disease476** 2* Heart Disease476** 2* Heart Disease4790* ≥40 Flu4790* ≥40 Heart Disease4790* ≥40 Cancer476** 3* Heart Disease476** 3* Cancer476** 3* CancerBobZipcode Age47678 27CarlZipcode Age47673 36Background Knowledge Attackl-Diversity Principle Each equivalence class has at least l well-represented sensitive values Distinct l-diversity Each equivalence class has at least l distinct sensitive valuesProbabilistic inferencePage 11 of 30Probabilistic inference10 records8 records have HIV2 records have other valuesl-Diversity Probabilistic l-diversity The frequency of the most frequent value in an equivalence class is bounded by 1/l. Entropy l-diversityThe entropy of the distribution of sensitive values in each equivalence Page 12 of 30The entropy of the distribution of sensitive values in each equivalence class is at least log(l) Recursive (c,l)-diversity The most frequent value does not appear too frequently r1<c(rl+rl+1+…+rm) where riis the frequency of the i-th most frequent value.Limitations of l-Diversityl-Diversity may be difficult and unnecessary to achieve. A single sensitive attribute Two values: HIV positive (1%) and HIV negative (99%) Very different degrees of sensitivityPage 13 of 30 l-diversity is unnecessary to achieve 2-diversity is unnecessary for an equivalence class that contains only negative records l-diversity is difficult to achieve Suppose there are 10000 records in total To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classesLimitations of l-Diversityl-Diversity is insufficient to prevent attribute disclosure.Skewness Attack Two sensitive values HIV positive (1%) and HIV negative (99%)Page 14 of 30l-Diversity does not consider the overall distribution of sensitive values Serious privacy risk Consider an equivalence class that contains an equal number of positive records and negative records l-diversity does not differentiate: Equivalence class 1: 49 positive + 1 negative Equivalence class 2: 1 positive + 49 negativeLimitations of l-DiversityBobZipAgeZipcode Age Salary Disease476** 2* 20K


View Full Document

Purdue CS 59000 - Data Anonymization for Database Privacy

Documents in this Course
Lecture 4

Lecture 4

42 pages

Lecture 6

Lecture 6

38 pages

Load more
Download Data Anonymization for Database Privacy
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Data Anonymization for Database Privacy and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Data Anonymization for Database Privacy 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?