Lecture 23: Query Optimization (3) Monday, November 22, 2010 Dan Suciu -- 444 Spring 2010 1Dan Suciu -- 444 Spring 2010 2 Outline • Search space • Algorithms for enumerating query plans • Estimating the cost of a query planDan Suciu -- 444 Spring 2010 3 Computing the Cost of a Plan • Collect statistical summaries of stored data • Estimate size in a bottom-up fashion • Estimate cost by using the estimated sizeDan Suciu -- 444 Spring 2010 4 Statistics on Base Data • Collected information for each relation – Number of tuples (cardinality) – Indexes, number of keys in the index – Number of physical pages, clustering info – Statistical information on attributes • Min value, max value, number distinct values • Histograms – Correlations between columns (hard) • Collection approach: periodic, using samplingSize Estimation Problem Dan Suciu -- 444 Spring 2010 5 S = SELECT list! FROM R1, …, Rn ! WHERE cond1 AND cond2 AND . . . AND condk"Given T(R1), T(R2), …, T(Rn) Estimate T(S) How can we do this ? Note: doesn’t have to be exact.Size Estimation Problem Dan Suciu -- 444 Spring 2010 6 S = SELECT list! FROM R1, …, Rn ! WHERE cond1 AND cond2 AND . . . AND condk"Remark: T(S) ≤ T(R1) × T(R2) × … × T(Rn)Selectivity Factor • Each condition cond reduces the size by some factor called selectivity factor • Assuming independence, multiply the selectivity factors Dan Suciu -- 444 Spring 2010 7Example Dan Suciu -- 444 Spring 2010 8 SELECT * FROM R, S, T WHERE R.B=S.B and S.C=T.C and R.A<40 R(A,B) S(B,C) T(C,D) T(R) = 30k, T(S) = 200k, T(T) = 10k Selectivity of R.B = S.B is 1/3 Selectivity of S.C = T.C is 1/10 Selectivity of R.A < 40 is ½ What is the estimated size of the query output ?Rule of Thumb • If selectivities are unknown, then: selectivity factor = 1/10 [System R, 1979] Dan Suciu -- 444 Spring 2010 910"Selectivities from Statistics • Condition is A = c /* value selection on R */ – Selectivity = 1/V(R,A) • Condition is A < c /* range selection on R */ – Selectivity = (c - Low(R, A))/(High(R,A) - Low(R,A))T(R) • Condition is A = B /* R ⨝A=B S */ – Selectivity = 1 / max(V(R,A),V(S,A)) – (will explain next) Dan Suciu -- 444 Spring 201011"Assumptions • Containment of values: if V(R,A) <= V(S,B), then the set of A values of R is included in the set of B values of S – Note: this indeed holds when A is a foreign key in R, and B is a key in S • Preservation of values: for any other attribute C, V(R ⨝A=B S, C) = V(R, C) (or V(S, C)) Dan Suciu -- 444 Spring 201012"Selectivity of R ⨝A=B S Assume V(R,A) <= V(S,B) • mmmmmmmhk,mmmbknmmmmmmmmmmmmktt • Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S • Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B) In general: T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B)) Dan Suciu -- 444 Spring 201013"Size Estimation for Join Example: • T(R) = 10000, T(S) = 20000 • V(R,A) = 100, V(S,B) = 200 • How large is R ⨝A=B S ? Dan Suciu -- 444 Spring 201014 Histograms • Statistics on data maintained by the RDBMS • Makes size estimation much more accurate (hence, cost estimations are more accurate) Dan Suciu -- 444 Spring 2010Histograms Dan Suciu -- 444 Spring 2010 15 Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?Histograms Dan Suciu -- 444 Spring 2010 16 Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 σage=48(Empolyee) = ? Estimate = 25000 / 50 = 500 Estimate = 25000 * 6 / 60 = 2500 σage>28 and age<35(Empolyee) = ?Histograms Dan Suciu -- 444 Spring 2010 17 Age:" 0..20" 20..29" 30-39" 40-49" 50-59" > 60"Tuples" 200" 800" 5000" 12000" 6500" 500"Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?Histograms 18 Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 σage=48(Empolyee) = ? Estimate = 1200 Estimate = 2*80 + 5*500 = 2660 σage>28 and age<35(Empolyee) = ? Age:" 0..20" 20..29" 30-39" 40-49" 50-59" > 60"Tuples" 200" 800" 5000" 12000" 6500" 500"Types of Histograms • How should we determine the bucket boundaries in a histogram ? Dan Suciu -- 444 Spring 2010 19Types of Histograms • How should we determine the bucket boundaries in a histogram ? • Eq-Width • Eq-Depth • Compressed Dan Suciu -- 444 Spring 2010 20Histograms Dan Suciu -- 444 Spring 2010 21 Age:" 0..20" 20..29" 30-39" 40-49" 50-59" > 60"Tuples" 200" 800" 5000" 12000" 6500" 500"Employee(ssn, name, age) Age:" 0..20" 20..29" 30-39" 40-49" 50-59" > 60"Tuples" 1800" 2000" 2100" 2200" 1900" 1800"Eq-width: Eq-depth: Compressed: store separately some highly frequent values: (48,1900)Difficult Questions on Histograms • Small number of buckets – Hundreds, or thousands, but not more – WHY ? • Not updated during database update, but recomputed periodically – WHY ? • Multidimensional histograms rarely used – WHY ? Dan Suciu -- 444 Spring 2010 22Summary of Query Optimization • Three parts: – search space, algorithms, size/cost estimation • Ideal goal: find optimal plan. But – Impossible to estimate accurately – Impossible to search the entire space • Goal of today’s optimizers: – Avoid very bad plans Dan Suciu -- 444 Spring 2010
View Full Document