New version page

Yale CPSC 457 - I’m in the Database, But Nobody Knows

This preview shows page 1-2-14-15-29-30 out of 30 pages.

View Full Document
View Full Document

End of preview. Want to read all 30 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

I’m in the Database, But Nobody KnowsMany Threats to Privacy of Electronic DataThis Talk: Privacy-Preserving Data Analysis“Pure” Privacy ProblemTypical SuggestionsPowerPoint PresentationAOL Search History Release (2006)William Weld’s Medical Record [Sweeney02]Slide 9GWAS Membership [Homer et al. ‘08]Anonymized Social Networks [BackstromDK07]Definitional FailuresParable: How Tall is Pamela Jones (Groklaw)?Differential Privacy [Dwork-McSherry-Nissim-Smith 2006]Differential PrivacySlide 16Slide 17Slide 18Snow 1854https://h1n1.cloudapp.net/Privacy.aspxMission CreepPan-Private Streaming Algorithms [DNPRY10]DiffeP: Limitations and ChallengesUtility Implies Exposure to HarmPauseWhich Ad(s) Am I Charged For?More Subtle AttackWall Street Journal 4/4/2010Work in ProgressThank You!I’m in the Database, But Nobody KnowsCynthia Dwork, Microsoft ResearchMany Threats to Privacy of Electronic DataTheftPhishingVirusesCryptanalysisChanging Privacy Policies…This Talk: Privacy-Preserving Data Analysis“First Tier” Motivating ExamplesAnalysis of Census Data, Medical Outcomes Data, GWAS data, Epidemiology, Analysis of Vehicle Braking Records“Second Tier” ExamplesTraining an advertising classifier, Recommendation System, Netflix ChallengeCDifficult Even ifCurator is AngelData are in VaultC“Pure” Privacy ProblemTypical Suggestions“Large Set” Queries How many MSFT employees have Sickle Cell Trait (CST)?How many MSFT employees who are not female Distinguished Scientists with very curly hair have the CST?Add Random Noise to True AnswerAverage of responses to repeated queries converges to true answerCan’t simply detect repetition (undecidable)Detect When Answering is UnsafeRefusal can be disclosiveA LitanyName: Thelma ArnoldAge: 62WidowResidence: Lilburn, GAAOL Search History Release (2006)William Weld’s Medical Record [Sweeney02]ZIPbirthdatesexnameaddressdate reg.partyaffiliationlast votedethnicityvisit datediagnosisproceduremedicationtotal chargevoter registrationdata HMO dataGWAS Membership [Homer et al. ‘08]SNP: Single Nucleotide (A,C,G,T) polymorphismCTTT…………Reference PopulationMajor Allele (C): 94%Minor Allele (T): 6%Genome-Wide Association Study Allele frequencies for many thousands of SNPSAnonymized Social Networks [BackstromDK07] Magic StepIsolate lightly linked-in subgraphs from rest of graph Special structure of subgraph permits finding A, BSJABDefinitional FailuresFailure to Cope with Auxiliary InformationExisting and future databases, newspaper reports, Flikr, literature, etc.Definitions are Syntactic and Ad HocDalenius’s Ad Omnia Guarantee (1977): Anything that can be learned about a respondent from the statistical database can be learned without access to the databaseUnachievableParable: How Tall is Pamela Jones (Groklaw)?An Admittedly Unreasonable Impossibility Proof Database teaches average heights of population subgroups“PJ is 2 inches shorter than avg Swedish woman”PJ’s height learnable with the DB, not learnable without.PJ loses privacy whether or not she is in the databaseSuggests new notion of privacy: risk incurred by joining DBThe outcome of any analysis is essentially equally likely, independent of whether any individual joins or refrains from joining the dataset. (The likelihood is over the choices made by the algorithm.)Neutralizes all linkage attacks.Composes unconditionally and automatically: Σi  i http://research.microsoft.com/en-us/projects/DatabasePrivacy/Differential Privacy [Dwork-McSherry-Nissim-Smith 2006]Bad Responses: X XXPr [response]ratio boundedM gives  -differential privacy if for all adjacent D1 and D2, and all C µ range(M): Pr[ M (D1) 2 C] ≤ ePr[ M (D2) 2 C]Differential Privacy Resilience to All Auxiliary InformationPast, present, future data sources and algorithmsLow-error high-privacy DP techniques exist for many problemsdatamining tasks (association rules, decision trees, clustering, …), contingency tables, histograms, synthetic data sets for query logs, machine learning (boosting, statistical queries learning model, SVMs, logistic regression), various statistical estimators, network trace analysis, recommendation systems, …Programming Platformshttp://research.microsoft.com/en-us/projects/PINQ/http://userweb.cs.utexas.edu/~shmat/shmat_nsdi10.pdfDownload today!Can we store and share your information with health officials and researchers?“…This information can be very helpful in monitoring regional health conditions, plan flu response, and conduct health research. By allowing the responses to the survey questions to be used for public health, education and research purposes, you can help your community.”Snow 1854SuspectedpumpSuspectedpumpCholera casesCholera caseshttps://h1n1.cloudapp.net/Privacy.aspx“Microsoft may also disclose information if required to do so by law or in the good faith belief that such action is necessary to (a) conform to the edicts of the law or comply with legal process served on Microsoft or the Site; (b) protect and defend the rights or property of Microsoft and our family of Web sites; or (c) act in urgent circumstances to protect the personal safety of users of Microsoft products or members of the public.”Mission Creep?C“Think of the children!”Never store the data!Pan-Private Streaming Algorithms [DNPRY10]Private “inside and out” Completely hide the pattern of appearances of any individualPresence, absence, frequency, etc.Protect against mission creep, subpoena, intrusionDiffeP: Limitations and ChallengesCan’t study outliersPrivacy erosion over multiple analyses is cumulativePrivacy erosion over multiple databases is cumulativeCompare real world to one in which my info is everywhere deleted, looking at a lifetime of exposure against worst-case adversary/information/collection of databasesFormally capture “reasonable” worlds?What are the right questions to ask about social networks?Removing one person can affect data of many other peopleUtility Implies Exposure to HarmDatabase teaches that smoking causes cancer. Smoker S’s insurance premiums rise. But learning that smoking causes cancer is the whole point.Smoker S enrolls in a smoking cessation program.May be fine for “First-Tier” Uses, but what about “Second Tier”?Who decides, and


View Full Document
Loading Unlocking...
Login

Join to view I’m in the Database, But Nobody Knows and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view I’m in the Database, But Nobody Knows and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?