11/4/08 1 CMSC424: Database Design Instructor: Amol Deshpande [email protected] Data$Models$ Conceptual$representa1on$of$the$data$ Data$Retrieval$ How$to$ask$ques1ons$of$the$database$ How$to$answer$those$ques1ons$ Data$Storage$ How/where$to$store$data,$how$to$access$it$ Data$Integrity$ Manage$crashes,$concurrency$ Manage$seman1c$inconsistencies$Databases11/4/08 2 Query Processing Overview Selection operation Join operators Sorting Other operators Putting it all together… Overview User select * from R, S where … R, B+Tree on R.a S, Hash Index on S.a … Results Query Parser Resolve the references, Syntax errors etc. Converts the query to an internal format relational algebra like Query Optimizer Find the best way to evaluate the query Which index to use ? What join method to use ? … Query Processor Read the data from the files Do the query processing joins, selections, aggregates …11/4/08 3 “Cost” Complicated to compute We will focus on disk: Number of I/Os ? Not sufficient Number of seeks matters a lot… why ? tT – time to transfer one block tS – time for one seek Cost for b block transfers plus S seeks b * tT + S * tS Measured in seconds Query Processing Overview Selection operation Join operators Sorting Other operators Putting it all together…11/4/08 4 Selection Operation select * from person where SSN = “123” Option 1: Sequential Scan Read the relation start to end and look for “123” Can always be used (not true for the other options) Cost ? Let br = Number of relation blocks Then: 1 seek and br block transfers So: tS + br * tT sec Improvements: If SSN is a key, then can stop when found So on average, br/2 blocks accessed Selection Operation select * from person where SSN = “123” Option 2 : Binary Search: Pre-condition: The relation is sorted on SSN Selection condition is an equality E.g. can’t apply to “Name like ‘%424%’” Do binary search Cost of finding the first tuple that matches log2(br) * (tT + tS) All I/Os are random, so need a seek for all The last few are closeby, but we ignore such small effects Not quite: What if 10000 tuples match the condition ? Incurs additional cost11/4/08 5 Selection Operation select * from person where SSN = “123” Option 3 : Use Index Pre-condition: An appropriate index must exist Use the index Find the first leaf page that contains the search key Retrieve all the tuples that match by following the pointers If primary index, the relation is sorted by the search key Go to the relation and read blocks sequentially If secondary index, must follow all pointers using the index Selection w/ B+-Tree Indexes n * (tT + tS) n = number of records that match This can be bad hi * (tT + tS) secondary index, not a key, equality 1 * (tT + tS) hi * (tT + tS) secondary index, candidate key, equality 1 * (tT + tS) + (b – 1) * tT Note: primary == sorted b = number of pages that contain the matches hi * (tT + tS) primary index, not a key, equality 1 * (tT + tS) hi * (tT + tS) primary index, candidate key, equality cost of retrieving the tuples cost of finding the first leaf hi = height of the index11/4/08 6 Selection Operation Selections involving ranges select * from accounts where balance > 100000 select * from matches where matchdate between ’10/20/06’ and ’10/30/06’ Option 1: Sequential scan Option 2: Using an appropriate index Can’t use hash indexes for this purpose Cost formulas: Range queries == “equality” on “non-key” attributes So rows 3 and 5 in the preceding page Selection Operation Complex selections Conjunctive: select * from accounts where balance > 100000 and SSN = “123” Disjunctive: select * from accounts where balance > 100000 or SSN = “123” Option 1: Sequential scan Option 2 (Conjunctive only): Using an appropriate index on one of the conditions E.g. Use SSN index to evaluate SSN = “123”. Apply the second condtion to the tuples that match Or do the other way around (if index on balance exists) Which is better ? Option 3 (Conjunctive only) : Choose a multi-key index Not commonly available11/4/08 7 Selection Operation Complex selections Conjunctive: select * from accounts where balance > 100000 and SSN = “123” Disjunctive: select * from accounts where balance > 100000 or SSN = “123” Option 4: Conjunction or disjunction of record identifiers Use indexes to find all RIDs that match each of the conditions Do an intersection (for conjunction) or a union (for disjunction) Sort the records and fetch them in one shot Called “Index-ANDing” or “Index-ORing” Heavily used in commercial systems Query Processing Overview Selection operation Join operators Sorting Other operators Putting it all together…11/4/08 8 Join select * from R, S where R.a = S.a Called an “equi-join” select * from R, S where |R.a – S.a | < 0.5 Not an “equi-join” Option 1: Nested-loops for each tuple r in R for each tuple s in S check if r.a = s.a (or whether |r.a – s.a| < 0.5) Can be used for any join condition As opposed to some algorithms we will see later R called outer relation S called inner relation Nested-loops Join Cost ? Depends on the actual values of parameters, especially memory br, bs Number of blocks of R and S nr, ns Number of tuples of R and S Case 1: Minimum memory required = 3 blocks One to hold the current R block, one for current S block, one for the result being produced Blocks transferred: Must scan R tuples once: br For each R tuple, must scan S: nr * bs Seeks ? nr + br11/4/08 9 Nested-loops Join Case 1: Minimum memory required = 3 blocks Blocks transferred: nr * bs +
View Full Document