Lecture 9 Query Optimization Sept 26 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang s lecture slides CS511 Advanced Database Management Systems 1 DBMS Architecture User Web Forms Applications DBA Today s lecture query transaction Query Parser Transaction Manager Query Rewriter Query Optimizer Lock Manager Logging Recovery Query Executor Files Access Methods Buffer Manager Buffers Main Memory Storage Manager Storage CS511 Advanced Database Management Systems Lock Tables Past lectures 2 SQL Constructs SECLECT DISTINCT list of columns FROM list of tables WHERE list of Boolean Factors GROUP BY list of columns HAVING list of Boolean Factors ORDER BY list of columns CS511 Advanced Database Management Systems 3 SQL Semantics Take Cartesian product of FROM tables Project only those referenced columns WHERE apply all filters in WHERE GROUP BY form groups on results HAVING apply filter to groups ORDER BY make sure results in right order DISTINCT remove duplicates Q Is this operational semantics efficient Different plans mainly different in the first three CS511 Advanced Database Management Systems 4 Optimization Different Strategies Optimal approach enumerate each possible plan measure its performance by running it pick the fastest one Heuristics approach fixed heuristics all the way through plan construction e g always nested loop joins indexed relation as inner e g order relations from smallest to biggest How does Selinger System R differ from them What are the main components in their approach CS511 Advanced Database Management Systems 5 New Paradigm Cost based Optimization Plan space what is the space of query plans Cost estimation how to estimate the cost without executing each Search algorithm how to search the space as guided by cost estimates CS511 Advanced Database Management Systems 6 Space of Query Plans What can you do differently Parameters to tune Selections Joins CS511 Advanced Database Management Systems 7 Space of Query Plans Selections algorithms sequential index scan ordering why this will matter Joins algorithms nested loop sort merge hash ordering Ordering Grouping can an interesting order be produced by join selections algorithm sorting hash based They interleave with each other CS511 Advanced Database Management Systems 8 Assumptions to Reduce the Space Projections pushed down to reduce of columns Selections Joins CS511 Advanced Database Management Systems 9 Huge Space Assumptions to Help Typical assumptions to help reduce the space Projections pushed down to reduce of columns Selections pushed down to reduce of rows Joins left deep joins what else can you do avoid Cartesian products delay it in the plan Q how to avoid Cartesian products May miss an optimal plan CS511 Advanced Database Management Systems 10 Cost Size Estimation Accurate relatively goal is to compare plans not to predict exact cost more of an art than an exact science Each operator input size cost output size estimate cost based on input size estimate output size for next operator or selectivity selectivity ratio of output to input CS511 Advanced Database Management Systems 11 Cost Estimation Selinger Style Input simple DB statistics of tuples disk pages of distinct values per column statistics updated periodically Assumption of attribute predicate independence When no estimate available use magic number New Alternative approaches sampling histogram of DB CS511 Advanced Database Management Systems 12 Selectivity Factors Point Selection column value dept CS input size NCARD Student 200 ICARD dept 10 output size selectivity CS511 Advanced Database Management Systems 13 Selectivity Factors Range Selection What is assumed in these formulas column value1 F maxValue value1 rangeOfValue column in value1 value2 F value2 value1 rangeOfValue column in set of values F as union of point selections CS511 Advanced Database Management Systems 14 Selectivity Factors Join Predicates column1 column2 when both columns are keys student sid employee sid one key the other foreign key student sid enrollment sid student NCARD 200 SID distinct values 200 enrollment NCARD 800 SID distinct values 100 800 200 200 800 none are keys user sid enrollement sid user NCARD 1000 SID distinct values 200 enrollment NCARD 800 SID distinct values 100 800 1000 200 4000 given formula covers all CS511 Advanced Database Management Systems 15 Goal Cost Estimate System R cost model Sum of I O and CPU PAGE W RSI CALLS why RSI calls model workload of RSS CS511 Advanced Database Management Systems 16 Search the Plan Space Baseline exhaustive search enumerate all combinations and compare their cost Search method parameters plan tree development construction bottom up top down modification improve a somehow constructed tree algorithms heuristic selections make choices based on heuristics branch and bound search bounded by the current best tree hill climbing find nearby plans with lowest cost dynamic programming construction by greedy selections where does System R approach fit in CS511 Advanced Database Management Systems 17 Plan Search System R Style AKA Selinger style optimization Bottom up start from ground relations in FROM work up the tree to form a plan Dynamic programming greedily prune subtrees that are obviously useless but not necessarily local best at every step why Many other approaches CS511 Advanced Database Management Systems 18 DP Local Optimal Global Optimal Think of optimizing T1 T2 T3 T4 T5 Sometime it does hold does plan T1 T2 T3 affect the rest of the plan Not always true example but hopefully at least give good plans CS511 Advanced Database Management Systems 19 System R Search Start from Relations Base relations access find all plans for accessing each base relations push down sargable arguments example of non sargable predicates choose good plans discard bad ones keep cheapest for unordered each interesting order Join ordering find all ways of joining a pair of base relations choose good plans discard bad ones CS511 Advanced Database Management Systems 20 WHERE Clause and Sargable Predicates WHERE conditions dept CS OR dept EE AND GPA 3 5 Normal forms DNF C OR E AND G CNF C OR E AND C OR G Difference in using the CNF DNF factors CS511 Advanced Database Management Systems 21 System R Search Left Deep Join Plans Join ordering consider only left deep trees n ordering for n tables basis find all ways of joining a pair of base relations choose good plans discard bad ones induction until you have a full plan k to k 1
View Full Document