Berkeley COMPSCI 186 - Approximation Techniques for Data Management Systems - D577051

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 186> Approximation Techniques for Data Management Systems

Berkeley COMPSCI 186 - Approximation Techniques for Data Management Systems

School name University of California, Berkeley

Course Compsci 186- Introduction to Database Systems

Pages 27

Download Save

Unformatted text preview:

Approximation Techniques for Data Management SystemsTraditional Query ProcessingFast Approximate AnswersApproximate Query ProcessingSampling: BasicsSampling: Confidence IntervalsSampling from DatabasesOne-Pass Uniform SamplingHistograms1-D HistogramsAnswering Queries Using HistogramsHaar Wavelet SynopsesHaar Wavelet CoefficientsWavelet Data SynopsesMulti-dimensional Data SynopsesMulti-dimensional HistogramsData Synopses in Commercial DBMSsData-Stream ManagementNetworks Generate Massive Data StreamsReal-Time Data-Stream AnalysisData-Stream Processing ModelDistinct Value EstimationHash (aka FM) Sketches for Count DistinctHash (aka FM) Sketches for Count DistinctA Little Streaming Puzzle…In Summary: Not your parents’ DBMS!More details…Approximation Techniques Approximation Techniques for Data Management for Data Management SystemsSystems““We are drowning in data but We are drowning in data but starved for knowledge”starved for knowledge”John NaisbittJohn NaisbittCS 186Fall 20052Traditional Query Traditional Query ProcessingProcessing-Exact answers NOT always required–DSS applications usually exploratory: early feedback to help identify “interesting” regions–Aggregate queries: precision to “last decimal” not needed•e.g., “What percentage of the US sales are in NJ?” SQL QueryExact AnswerDecisionSupport Systems(DSS) Long Response Times!GB/TB3-Primarily for Aggregate queries-Goal is to quickly report the leading digits of answers–In seconds instead of minutes or hours–Most useful if can provide error guarantees E.g., Average salary $59,000 +/- $500 (with 95% confidence) in 10 seconds vs. $59,152.25 in 10 minutes-Achieved by answering the query based on compact synopsescompact synopses of the data-Speed-up obtained because synopses are orders of magnitude smaller than the original dataFast Approximate AnswersFast Approximate Answers4ApproximateApproximate Query Query ProcessingProcessing-How do you How do you build effective data synopses???build effective data synopses???SQL QueryExact AnswerDecisionSupport Systems(DSS) Long Response Times!GB/TBCompact Data Synopses“Transformed” QueryKB/MBApproximate AnswerFAST!!5Sampling: BasicsSampling: Basics-Idea: A small random sample S of the data often well-represents all the data–For a fast approx answer, apply the query to S & “scale” the result–E.g., R.a is {0,1}, S is a 20% sampleselect count(*) from R where R.a = 0select 5 * count(*) from S where S.a = 01 1 0 1 1 1 1 1 0 0 00 1 1 1 1 1 0 11 1 0 1 0 1 10 1 1 0Red = in SR.aR.aEst. count = 5*2 = 10, Exact count = 10Unbiased: For expressions involving count, sum, avg: the estimatoris unbiased, i.e., the expected value of the answer is the actual answer,even for (most) queries with predicates!• Leverage extensive literature on confidence intervals for samplingActual answer is within the interval [a,b] with a given probability E.g., 54,000 ± 600 with prob  90%6Sampling: Confidence IntervalsSampling: Confidence Intervals-If predicates, S above is subset of sample that satisfies the predicate-Quality of the estimate depends only on the variance in R & |S| after the predicate: So 10K sample may suffice for 10B row relation!–Advantage of larger samples: can handle more selective predicatesGuarantees?90% Confidence Interval (±)Methodas (S)  (R)3.16 * (S) / sqrt(|S|)Chebyshev (est. (R))always3.16 * (R) / sqrt(|S|)Chebyshev (known (R))always1.22 * (MAX-MIN) / sqrt(|S|)Hoeffdingas |S|   1.65 * (S) / sqrt(|S|)Central Limit TheoremConfidence intervals for Average: select avg(R.A) from R(Can replace R.A with any arithmetic expression on the attributes in R)(R) = standard deviation of the values of R.A; (S) = s.d. for S.A7Sampling from DatabasesSampling from Databases-Sampling disk-resident data is slow–Row-level sampling has high I/O cost: •must bring in entire disk block to get the row–Block-level sampling: rows may be highly correlated–Random access pattern, possibly via an index–Need to account for the variable number of rows in a page, children in an index node, etc.-Alternatives–Random physical clustering: destroys “natural” clustering–Precomputed samples: must incrementally maintain (at specified size)•Fast to use: packed in disk blocks, can sequentially scan, can store as relation and leverage full DBMS query support, can store in main memory8One-Pass Uniform SamplingOne-Pass Uniform Sampling-Best choice for incremental maintenance–Low overheads, no random data access-Reservoir Sampling [Vit85]: Maintains a sample S of a fixed-size MMaintains a sample S of a fixed-size M–Add each new item to S with probability M/N, where N is the current number of data items–If add an item, evict a random item from S–Instead of flipping a coin for each item, determine the number of items to skip before the next to be added to S9HistogramsHistograms-Partition attribute value(s) domain into a set of buckets-Issues: –How to partition–What to store for each bucket–How to estimate an answer using the histogram-Long history of use for selectivity estimation within a query optimizer-Recently explored as a tool for fast approximate query processing101-D Histograms1-D Histograms-Number of buckets B << domain size-Each bucket just stores a total count–Distributed uniformly across values in the bucket-Partition criteriaPartition criteria–Equi-width: equal number of domain values per bucket (bad!!)–Equi-depth/height : equal count (“mass”) per bucket–V-Optimal : minimize total variance of value counts in bucketsCount inbucketDomain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 2011Answering Queries Using HistogramsAnswering Queries Using Histograms-Answering queries from 1-D histograms (in general):–(Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation-Inside each bucket: Uniformity AssumptionUniformity Assumption–Continuous value mapping1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Need numberof distinct ineach bucket3 2 1 2 3 1Count spreadevenly amongbucket values- Uniform spread mapping4  R.A  1512Haar Wavelet SynopsesHaar Wavelet Synopses-WaveletsWavelets:: mathematical tool for hierarchical decomposition of functions/signals -Haar

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 186 - Approximation Techniques for Data Management Systems

Sign up for free to view:

Please select your school