Introduction to Database Systems CSE 444 Lecture 20 Overview of Query Optimization CSE 444 Summer 2010 1 Where We Are Back to how a DBMS executes a query What we learned so far How data is stored and indexed lectures 15 and 16 Logical query plans relational algebra lecture 17 Steps involved in processing a query lecture 18 Operator algorithms lecture 19 Today How to select logical physical query plans Chapter Ch t 16 iin th the b book k recommended d d nott required i d CSE 444 Summer 2010 2 Query Optimization Goal For a query There exists many logical and physical query plans Query optimizer needs to pick a good one CSE 444 Summer 2010 3 Query Optimization Algorithm Enumerate alternative plans Compute estimated cost of each plan Compute number of I Os Compute CPU cost Choose plan with lowest cost This is called cost based optimization CSE 444 Summer 2010 4 Outline Search space Algorithm for enumerating query plans Estimating the cost of a query plan CSE 444 Summer 2010 5 Relational Algebra Equivalences Selections Commutative c1 c2 R same as c2 c1 R Cascading c1 c2 R same as c2 c1 R Projections Cascading Joins Commutative R S same as S R Associative R S T same as R S T CSE 444 Summer 2010 6 Left Deep Left Deep Plans and Bushy Plans R2 R4 R3 R1 Left deep Left deep plan R3 R1 R2 R4 Bushy plan CSE 444 Summer 2010 7 Relational Algebra Equivalences Selects projects and joins We can commute and combine all three types of operators We just have to be careful that the fields we need are available when we apply pp y the operator p Relatively straightforward See book 16 2 CSE 444 Summer 2010 8 Search Space Challenges Search space is huge Many possible equivalent trees Many implementations for each operator Many access paths for each relation File Fil scan or index i d matching t hi selection l ti condition diti Cannot consider ALL plans Want search space that includes low cost plans CSE 444 Summer 2010 9 Outline Search space Algorithms for enumerating query plans Estimating the cost of a query plan CSE 444 Summer 2010 10 Key Decisions When selecting a plan some of the most important decisions include Logical plan Can we push selections down Can we push projections or aggregations down What order to use for joins Physical plan What join algorithms to use What access paths to use file scan or index CSE 444 Summer 2010 11 Plan Enumeration Algorithms Rule based vs cost based algorithms g p plans Logical Heuristic based algorithms Use size of intermediate results as cost measure Physical plans Top down algorithms or Bottom up dynamic programming approaches Also called Selinger style optimizers Use heuristics to limit search space CSE 444 Summer 2010 12 Outline Search space Algorithms for enumerating query plans Estimating the cost of a query plan CSE 444 Summer 2010 13 Computing the Cost of a Plan Collect statistical summaries of stored data Compute cost in a bottom up fashion For each operator compute Estimate cost of executing g the operation p Estimate statistical summary of the output data CSE 444 Summer 2010 14 Statistics on Base Data Collected information for each relation Number of tuples cardinality Indexes number of keys in the index Number of physical pages clustering info Statistical information on attributes Min value max value number distinct values Histograms g Correlations between columns hard Collection approach pp p periodic using g sampling p g CSE 444 Summer 2010 15 Retrieving data from Storage Access path a way to retrieve tuples from a table A file scan An index plus a matching selection condition Index matches selection condition if it can be used to retrieve just tuples that satisfy the condition Example Supplier sid sname scity sstate B tree index on scity sstate matches scity scity Seattle Seattle does not match sid 3 does not match sstate WA CSE 444 Summer 2010 16 Access Path Selection Supplier sid sname scity sstate Selection condition sid 300 scity Seattle Indexes I d B t B tree on sid id and d B B tree t on scity it Which access path should we use We should p pick the most selective access p path CSE 444 Summer 2010 17 Access Path Selectivity Access path selectivity is the number of pages retrieved if we use this access path Most selective retrieves fewest pages As we saw earlier for equality predicates Selection on equality a v R V R a of distinct values of attribute a 1 V R a is thus the reduction factor Cl stered inde Clustered index on a a cost B R V R B R V R a a Unclustered index on a cost T R V R a we are ignoring I O cost of index pages for simplicity CSE 444 Summer 2010 18 Selectivity for Range Predicates Selection on range a v R How to compute the selectivity Assume values are uniformly distributed R d ti ffactor Reduction t X X Max R a v Max R a Min R a Clustered index on a cost is B R X Unclustered index on a cost is T R X CSE 444 Summer 2010 19 Back to Our Example Selection condition sid 300 scity Seattle Index I1 B tree on sid clustered Index I2 B B tree tree on scity unclustered Let s assume V Supplier scity 20 Max Supplier sid 1000 Min Supplier sid 1 B Supplier 100 100 T Supplier 1000 Cost I1 B R Max v Max Min 100 700 999 70 Cost I2 T R 1 V Supplier scity 1 V Supplier scity 1000 20 50 CSE 444 Summer 2010 20 Selectivity with Multiple Conditions What if we have an index on multiple attributes Example selection a v1 b v2 R and index on a b How to compute the selectivity Assume A attributes tt ib t are independent i d d t X 1 V R a V R b Clustered index on a b cost B R X cost T R X Unclustered index on a b CSE 444 Summer 2010 21 Computing Cost of an Operator The cost of executing an operator depends On the operator implementation On the input data We learned how to compute this in the previous lecture so we do not repeat it here CSE 444 Summer 2010 22 Statistics on the Output Data Most important piece of information Size of operator result I e I e the number of output tuples Projection output size same as input size Selection multiply input size by reduction factor Si Similar il to t what h t we did for f estimating ti ti access path th selectivity l ti it Assume independence between conditions in the predicate use product of the reduction factors for the terms CSE 444 Summer 2010 23 Estimating Result Sizes For joins R S Take product of cardinalities of relations R and S Apply reduction factors for each term in join condition Terms are of the form column1 column2 Reduction 1 MAX V R column1 V S …
View Full Document