Slide 1DatabasesQuery ProcessingOverview“Cost”Query ProcessingSelection OperationSelection OperationSelection OperationSelection w/ B+-Tree IndexesSelection OperationSelection OperationSelection OperationQuery ProcessingJoinNested-loops JoinNested-loops JoinNested-loops JoinBlock Nested-loops JoinBlock Nested-loops JoinBlock Nested-loops JoinIndex Nested-loops JoinIndex Nested-loops JoinIndex Nested-loops JoinSo far…Hash JoinHash JoinHash JoinHash JoinHash JoinHash JoinHash Join: IssuesHash Join: IssuesHash Join: IssuesHybrid Hash JoinHybrid Hash JoinSo far…Merge-Join (Sort-merge join)Merge-Join (Sort-merge join)Joins: SummaryQuery ProcessingSortingExternal sort-mergeExternal sort-mergeExample: External Sorting Using Sort-MergeExternal Merge Sort (Cont.)Query ProcessingGroup By and AggregationGroup By and AggregationGroup By and AggregationDuplicate EliminationSet operationsQuery ProcessingEvaluation of ExpressionsEvaluation of ExpressionsMaterializationPipeliningPipeliningPipelining: Demand-drivenHash-Join Iterator InterfaceHash-Join Iterator InterfacePipelining (Cont.)Recap: Query ProcessingCMSC424: Database DesignInstructor: Amol Deshpande [email protected]Data ModelsConceptual representation of the dataData RetrievalHow to ask questions of the databaseHow to answer those questionsData StorageHow/where to store data, how to access itData IntegrityManage crashes, concurrencyManage semantic inconsistenciesDatabasesQuery ProcessingOverviewSelection operation Join operatorsSortingOther operatorsPutting it all together…OverviewUserselect *from R, Swhere …R, B+Tree on R.aS, Hash Index on S.a…ResultsQuery ParserResolve the references,Syntax errors etc.Converts the query to an internal format relational algebra like Query OptimizerFind the best way to evaluate the query Which index to use ? What join method to use ? … Query ProcessorRead the data from the filesDo the query processing joins, selections, aggregates …“Cost”Complicated to computeWe will focus on disk:Number of I/Os ?Not sufficientNumber of seeks matters a lot… why ?tT – time to transfer one blocktS – time for one seekCost for b block transfers plus S seeks b * tT + S * tS Measured in secondsQuery ProcessingOverviewSelection operation Join operatorsSortingOther operatorsPutting it all together…Selection Operationselect * from person where SSN = “123”Option 1: Sequential ScanRead the relation start to end and look for “123”Can always be used (not true for the other options)Cost ?Let br = Number of relation blocksThen:1 seek and br block transfersSo:tS + br * tT secImprovements:If SSN is a key, then can stop when foundSo on average, br/2 blocks accessedSelection Operationselect * from person where SSN = “123”Option 2 : Binary Search:Pre-condition:The relation is sorted on SSNSelection condition is an equalityE.g. can’t apply to “Name like ‘%424%’”Do binary searchCost of finding the first tuple that matcheslog2(br) * (tT + tS)All I/Os are random, so need a seek for allThe last few are closeby, but we ignore such small effectsNot quite: What if 10000 tuples match the condition ?Incurs additional costSelection Operationselect * from person where SSN = “123”Option 3 : Use IndexPre-condition:An appropriate index must existUse the indexFind the first leaf page that contains the search keyRetrieve all the tuples that match by following the pointersIf primary index, the relation is sorted by the search keyGo to the relation and read blocks sequentiallyIf secondary index, must follow all pointers using the indexSelection w/ B+-Tree Indexesn * (tT + tS)n = number of records that matchThis can be badhi * (tT + tS)secondary index, not a key, equality1 * (tT + tS)hi * (tT + tS)secondary index, candidate key, equality1 * (tT + tS) + (b – 1) * tTNote: primary == sortedb = number of pages that contain the matcheshi * (tT + tS)primary index, not a key, equality1 * (tT + tS)hi * (tT + tS)primary index, candidate key, equalitycost of retrieving the tuplescost of finding the first leafhi = height of the indexSelection OperationSelections involving rangesselect * from accounts where balance > 100000select * from matches where matchdate between ’10/20/06’ and ’10/30/06’Option 1: Sequential scanOption 2: Using an appropriate indexCan’t use hash indexes for this purposeCost formulas:Range queries == “equality” on “non-key” attributesSo rows 3 and 5 in the preceding pageSelection OperationComplex selectionsConjunctive: select * from accounts where balance > 100000 and SSN = “123”Disjunctive: select * from accounts where balance > 100000 or SSN = “123”Option 1: Sequential scanOption 2 (Conjunctive only): Using an appropriate index on one of the conditionsE.g. Use SSN index to evaluate SSN = “123”. Apply the second condtion to the tuples that matchOr do the other way around (if index on balance exists)Which is better ?Option 3 (Conjunctive only) : Choose a multi-key index Not commonly availableSelection OperationComplex selectionsConjunctive: select * from accounts where balance > 100000 and SSN = “123”Disjunctive: select * from accounts where balance > 100000 or SSN = “123”Option 4: Conjunction or disjunction of record identifiersUse indexes to find all RIDs that match each of the conditionsDo an intersection (for conjunction) or a union (for disjunction)Sort the records and fetch them in one shotCalled “Index-ANDing” or “Index-ORing”Heavily used in commercial systemsQuery ProcessingOverviewSelection operation Join operatorsSortingOther operatorsPutting it all together…Joinselect * from R, S where R.a = S.aCalled an “equi-join”select * from R, S where |R.a – S.a | < 0.5Not an “equi-join”Option 1: Nested-loops for each tuple r in R for each tuple s in S check if r.a = s.a (or whether |r.a – s.a| < 0.5)Can be used for any join conditionAs opposed to some algorithms we will see laterR called outer relationS called inner relationNested-loops JoinCost ? Depends on the actual values of parameters, especially memorybr, bs Number of blocks of R and Snr, ns Number of
View Full Document