UCLA COMSCI 259 - Agrawal-Privacy_Preserving_Data_Mining - D632696

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 259> Agrawal-Privacy_Preserving_Data_Mining

UCLA COMSCI 259 - Agrawal-Privacy_Preserving_Data_Mining

School name University of California, Los Angeles

Course Comsci 259- Current Topics in Computer Science: System Design/Architecture

Pages 22

Download Save

Unformatted text preview:

Privacy-Preserving Data MiningWhat is Data Mining?A Simple Example of Data MiningWhere does privacy fit in?Preserving Data Privacy (1)Preserving Data Privacy (2)What do we mean by “private?”Reconstructing Original Distribution From Distorted Values (1)Reconstructing Original Distribution From Distorted Values (2)Reconstruction Algorithm (1)Reconstruction Algorithm (2)Distribution Reconstruction Results (1)Distribution Reconstruction Results (2)Summary of Reconstruction ExperimentsDecision-Tree Classifiers w/ Perturbed DataExperimental Results – Classification w/ Perturbed DataResults – Classification Accuracy (1)Results – Classification Accuracy (2)Experimental Results – Varying PrivacyResults – Accuracy vs. Privacy (1)Results – Accuracy vs. Privacy (2)ConclusionPrivacy-Preserving Data MiningRakesh Agrawal Ramakrishnan SrikantIBM Almaden Research Center650 Harry Road, San Jose, CA 95120Published in: ACM SIGMOD International Conference on Management of Data, 2000.Slides by: Adam Kaplan (for CS259, Fall’03)What is Data Mining?•Information Extraction –Performed on databases. •Combines theory from–machine learning–statistical analysis–database technology•Finds patterns/relationships in data •Trains the system to predict future results•Typical applications:–customer profiling–fraud detection–credit risk analysis. Definition borrowed from the Two Crows corporate website: http://www.twocrows.comA Simple Example of Data MiningHighHighLowSalary < 50kAge < 25CREDIT RISKPerson Age Salary Credit Risk0 23 50K High1 17 30K High2 43 40K High3 68 50K Low4 32 70K Low5 20 20K HighTRAINING DATAData Mining•Recursively partition training data into decision tree classifier–Non-leaf nodes = split points; test a specific data attribute–Leaf nodes = entirely or “mostly” represent data from the same class•Previous well-known methods to automate classification–[Mehta, Agrawal, Rissanen EDBT’96] – SLIQ paper–[Shafer, Agrawal, Mehta VLDB’96] – SPRINT paperWhere does privacy fit in?•Data mining performed on databases–Many of them contain sensitive/personal data•e.g. Salary, Age, Address, Credit History–Much of data mining concerned with aggregates•Statistical models of many records•May not require precise information from each record•Is it possible to…–Build a decision-tree classifier accurately•without accessing actual fields from user records?–(…thus protecting privacy of the user)Preserving Data Privacy (1)•Value-Class Membership–Discretization: values for an attribute are discretized into intervals•Intervals need not be of equal width.•Use the interval covering the data in computation, rather than the data itself.–Example:•Perhaps Adam doesn’t want people to know he makes $4000/year.–Maybe he’s more comfortable saying he makes between $0 - $20,000 per year.–The most often used method for hiding individual values.Preserving Data Privacy (2)•Value Distortion–Instead of using the actual data xi–Use xi + r, where r is a random value from a distribution.•Uniform Distribution–r is uniformly distributed between [-α, +α]–Average r is 0. •Gaussian Distribution–r has a normal distribution–Mean μ(r) is 0.–Standard_deviation(r) is σWhat do we mean by “private?”•If we can estimate with c% confidence–The value x lies within the interval [x1, x2]–Privacy = (x2 - x1), the size of the range.•If we want very high privacy–2α > W–Value distortion methods (Uniform, Gaussian) provide more privacy than discretization at higher confidence levels.W = width of intervals in discretizationReconstructing Original Distribution From Distorted Values (1)•Define:–Original data values: x1, x2, …, xn–Random variable distortion: y1, y2, …, yn–Distorted samples: x1+y1, x2+y2, …, xn+yn–FY : The Cumulative Distribution Function (CDF) of random distortion variables yi–FX : The CDF of original data values xiReconstructing Original Distribution From Distorted Values (2)•The Reconstruction Problem–Given •FY •distorted samples (x1+y1,…, xn+yn)–Estimate FXReconstruction Algorithm (1)How it works (incremental refinement of FX ) :1. The f(x, 0) initialized to uniform distribution2. For j=0 until stopping, do3. Find f(x, j+1) as a function of f(x, j) and FY4. When loop stops, f(x) estimates FXReconstruction Algorithm (2)Stopping Criterion•Compare successive estimates f(x, j).•Stop when difference between successive estimates very small.Distribution Reconstruction Results (1)Original = original distributionRandomized = effect of randomization on original dist.Reconstructed = reconstructed distributionDistribution Reconstruction Results (2)Original = original distributionRandomized = effect of randomization on original dist.Reconstructed = reconstructed distributionSummary of Reconstruction Experiments•Authors are able to reconstruct–Original shape of data–Almost same aggregate distribution•This can be done even when randomized data distribution looks nothing like the original.Decision-Tree Classifiers w/ Perturbed DataHighHighLowSalary < 50kAge < 25CREDIT RISKWhen/how to recover original distributions in order to build tree? • Global - for each attribute, reconstruct original distribution before building tree• ByClass – for each attribute, split the training data into classes, and reconstruct distributions separately for each class; then build tree• Local – like ByClass, reconstruct distribution separately for each class, but do this reconstruction while building decision treeExperimental Results – Classification w/ Perturbed Data•Compare Global, ByClass, Local algorithms against control series:–Original – result of classification of unperturbed training data–Randomized – result of classification on perturbed data with no correction •Run on five classification functions Fn1 through Fn5. (classify data into groups based on attributes)Results – Classification Accuracy (1)Results – Classification Accuracy (2)Experimental Results – Varying Privacy•Using ByClass algorithm on each classification function (except Fn4)–Vary privacy level from 10% - 200%–Show•Original – unperturbed data•ByClass(G) – ByClass with Gaussian perturbation•ByClass(U) – ByClass with Uniform perturbation•Random(G) – uncorrected data with Gaussian perturbation•Random(U) – uncorrected data with Uniform perturbationResults –

View Full Document


School:
Email:
New Password:
Confirm Password:

UCLA COMSCI 259 - Agrawal-Privacy_Preserving_Data_Mining

Sign up for free to view:

Please select your school