DOC PREVIEW
Gordon CPS 352 - Query Processing Optimization

This preview shows page 1-2-20-21 out of 21 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 21 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS352 Lecture - Query Processing / OptimizationLast revised 11/6/06Objectives:1. To understand why the choice of strategy for performing a query can have a huge impact on processing time2. To be familiar with various strategies for performing selection operations3. To be familiar with various strategies for performing join operations4. To be familiar with how statistical information can be used for evaluating strategiesMaterials:1. Projectables of pseudo-code for join strategies2. Projectables of equivalence rules for queries (authors' powerpoints)I. IntroductionA. Given a query, the DBMS must interpret it and "plan" a strategy for carrying it out. For all but the simplest queries, there will probably be several ways of doing so - with total processing effort possibly varying by several orders of magnitude.B. Some of the issues have to do with the way that data is physically stored on disk. Recall that, for most queries, the cost of accesses to data on disk far exceeds any other other cost, and becomes the most significant factor in determining the time needed to process most queries.C. Some of the issues have to do with the fact that several different, but equivalent, formulations of a query may be possible - perhaps with vastly different execution costs. Example: Given the schemes: BookAuthor(callNo, authorName)BookTitle(callNo, title)1. The question "Give me the titles of all books written by Korth" can be answered by either of the following two relational algebra queries:a) π σ BookTitle |X| BookAuthor title authorName = 'Korth'b) π BookTitle |X| σ BookAuthor title authorName = 'Korth'12. Because relational algebra is a procedural language, each of these queries implies a particular sequence of operations:a) The first query suggests that we should first do a natural join of BookAuthor and BookTitle, then select from it the tuples with authorName = 'Korth', then project out the title field.b) The second query suggests that we should first select only those tuples from Book Author where authorName = 'Korth', then join these with BookTitle, then project out the title field.3. Suppose our database held 10,000 BookTitle tuples and 20,000 BookAuthor tuples (an average of two authors/book.) Suppose, further, that only 2 BookAuthor tuples contained 'Korth'. How would the cost of these two strategies compare?a) The first strategy would process all 10,000 BookTitle tuples at least once, and likewise all 20,000 BookAuthor tuples. Further, it would involve creating a temporary relation with 3 attributes (callNo, title, authorName) and (presumably) 20,000 tuples. Each of these would need to be examined to see if it pertains to 'Korth'. Thus, a total of 50,000 tuples would need to be processed - minimum. (And the minimum could only be achieved if the join is facilitated by an index on callNo for at least one of the two original schemes.)b) On the other hand, the second strategy might involve processing only 2 BookAuthor tuples and the corresponding 2 BookTitle tuples (if appropriate indices exist) - for a total of 4 tuples. This is an effort ratio of 50,000: 4 = 12,500:1.D. low-performance DBMS might put the burden on the user to formulate his/her queries in such a way as to allow the most efficient processing of them. A good DBMS, however, will transform a given query into a more efficient equivalent one whenever possible.Example: If the first query above were given to a simple DBMS, it would perform very much less efficiently than if it were given the second. However, a sophisticated DBMS would transform the first query into something like the second before processing it if that were the form it was given.2E. In the remainder of this series of lectures, we want to explore the following topics:1. Strategies for performing selection2. Strategies for performing joinsBoth of these, in turn, are influenced by issues such as the way the data is stored physically on the disk, and the availability of indices3. Rules of equivalence for transforming queries4. The use of statistical information to help evaluate query-processing strategies.II. Selection StrategiesA. Consider a selection expression likeσ SomeTable SomeConditionWe consider several possibilities for the condition1. It involves searching for an exact match for the value of some attribute (e.g. “borrowerID = '12345'”), and the attribute is a candidate key for the table, so we expect at most one match.2. It involves searching for an exact match for the value of some attribute, but the attribute is not a candidate key for the table, so we can have any number of matches3. It involves searching for a value of some attribute lying in some range (e.g.“where age between 21 and 65” or “where age < 21”),4. It involves some more complex condition - perhaps involving a compound condition (“and” or “or”) or the values of two or more attributes.B. One way to handle any selection condition is to use linear search - scan through all the tuples in the table, and find the ones that match.1. In principle, this would seem to require one disk access for each tuple in the table.32. However, if the tuples are blocked, then it suffices to perform one disk access for a block, and then scan the buffered copies of the records in turn.Example: a table with 10,000 tuples - blocked 20 per block - would require 500 disk accesses - which would take on the order of 500*10ms = 5 seconds.C. If the selection involves an exact match, and the tuples are stored in order of the attribute in question (e.g. the table has a primary index based on the attribute), then an exact match search can be done using binary search, with a number of accesses = log2 # of blocks. 1. If we are looking for a single exact match, this works just fine.2. If we are looking for an exact match, but expect to find multiple tuples, we may have to look at blocks before and/or after the one binary search finds, since binary search will get us to a block containing the value we want, but not necessarily the first, and there might be matches in more than one successive block.3. If we are performing a range query, binary search can get us to the start of the range, and then we can scan sequentially until we get past the end of the range.4. Binary search is of no value when dealing with a complex selection condition.5. Binary search is of no value if the tuples are not stored in physical order of the attribute the search is based on.D. Exact match and range


View Full Document

Gordon CPS 352 - Query Processing Optimization

Download Query Processing Optimization
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Query Processing Optimization and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Query Processing Optimization 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?