DOC PREVIEW
UH COSC 6340 - COSC 6340 Qualifying Exam Part2

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

M + (M*N)/(B-2)= 500 + 500*1000/6=…Solution SketchesQualifying Exam Part2COSC 6340 (Data Management)May 10, 2002Your Name:Your SSN:Your Mailing Address:Problem 1 [15]: Implementing JoinsProblem 2 [10]: B+-treesProblem 3 [18]: Query OptimizationProblem 4 [9]: Association RulesProblem 5 [21]: Multidimensional Database Technology:Grade:The exam is “open books/notes” and you have 105 minutes to complete the exam. 11) Implementing Joins [15]Assume we have to implement a natural join between two relations R1(A, B, C) and R2(A, D). A is the primary key for R2 and R1.A is a foreign key for R1 (R1[A] . R2[A]). R1 and R2 are stored as an unordered file (blocks are assumed to be 100% full); R1 contains 200000 tuples that are stored in 500 blocks, and R2 has 50000 tuples that are stored in 1000 blocks (that is, every R2.A value occurs at an average four time as a value of R1.A). Moreover, assume that only a small buffer size of 8 blocks is available. Based on these assumptions answer the following questions (indicate, if you assume in your computations that the output relation is written back to disk or not): a. How many tuples will the output relation R=R1  R2 contain? [1]200000b. What is the cost for implementing the natural join using the block-nested loop join? [2]M + (M*N)/(B-2)= 500 + 500*1000/6=…c. Now assume that either a hashed index on A of R1 or a hashed index on A of R2 is available (assume that there are no overflow pages). Compute the cost for using the indexnested loops join using the index for R1.A and for using the index nested loops join for R2.A (2 computations][4]500 + 200000* (1+1)=…1000+ 50000* (1+4)=…d. Is it possible to apply the hash-join in this case (explain your answer!)? [2]No, 8>sqrt(500)e. Which of the 4 proposed implementations is the best? [1]bDefining an index on f. Now assume that the response time of the methods you evaluated is not satisfactory. What else could be done, to speed up? [5]Change the database design, create a new relation Relation R= R2 OUTERJOIN R1 Cost reduces to …(answer omitted) 22) B+ Tree Index Structures [10]Assume a relation R(A, B, C, D) is given; R is stored as an unordered file and contains 1000000 (1 million) tuples. Attributes A, B, C, D need 4 byte of storage each, and blocks have a size of 4096 Byte. Moreover, the following query is given:Q1: Select A,C,D from R where B<12222 and B> 1300 returns 30000 answers a) What is the cost of implementing Q1 assuming no index structures are available? [2]Scan R; the cost of 3907 block accessesb) Now assume a B+-tree index using attribute B as the search key is created. What parameters p (maximum of pointers in an intermediate node) and m (number of entries in a leaf node) would be chosen for the B+-tree and what will be the height of the B+-tree (assuming that each node of the tree are filled 60%)? You can assume thatB+-tree node pointers require 4 byte of storage and index pointers also require 4 byte of storage. Give reasons for your answers! [4]p=m=512; p*m*(0.6**2)< 10000 and p*p*m*(0.6**3) >10000; therefore 3 levelsc) Based on your answers to b), compute the cost for Q1 assuming the B+-tree index is used. Explain your computations! [4].2 (access B+ tree intermediate nodes)+ 98 (retrieve blocks of leaf nodes that satisfy the selection range in the query) + 30000 (access to blocks of relation R) =3010033) Query Optimization [18]Assume 4 relations R1(A,B) R2(A,C) R3(A,D) and R4(A,E) and the following SQL query is given: SELECT B, C, D FROM R1, R2, R3, R4 WHERE R1.A=R2.A and R2.A=R3.A and R3.A=R4.A.a) Give a query execution plan (Chaudhuri calls those physical operator trees) that implements the above query (does not need to be the most efficient one)[3]:b) The system R SQL optimizer only considers linear plans for query execution. Give an example of a plan for the above query that would not be considered by the SQL optimizer[2].Same as a)c) Using the above query as an example, explain the principle of optimality (explained on page 2 of the Chaudhuri paper). How does it help in reducing the number of plans that need to be evaluated by the query optimizer? [5]Principle of optimality: For plans for queries that involve k subexpressions it suffices to find optimal plans for a single subexpression, and then extend this plan by finding optimal plans for the remaining subexpressions (one by one)Example (see page 2 of paper)PoO reduces plans to be considered from O(n!) to O(n*2n-1 )d) Another critical problem in query optimization is the propagation of statistical information. Explain what this problem is by using the query plan you generated for sub problem a) as an example. [4]In order to evaluate the cost of the plan ((R1 NJOIN R2) NJOIN ((R3 NJOIN R4)) it is necessary to predict the size and statistical properties of the two relations that are obtained by joining R1 and R2, and by joining R3 and R4. Thesize and statistical properties of those intermediate relations have to be predictedby “propagating the statistical of R1, R2, R3, R4”; then this information is used to determine the cost of the third join operation4e) Chaudhuri states that “a desirable optimizer is one where … the enumeration algorithm is efficient”. What does this requirement mean? [4]The enumeration algorithm searches the implementation plan space for a query for a “good” or optimal solution; to be efficient it needs to have the following characteristics:a. it construct plans to be considered fastb. it employs heuristics that give preference to considering “good” plans prior to “bad” plansc. It is complete in the sense that all “promising” plans are considered4) Association Rule Mining [9]Compare the approach to association rule mining that relies on support, confidence, and APRIORI style algorithms with the approach that is advocated by the Cohen/Datar Paper.What are the main differences between the two approaches? What are the similarities of the two approaches? Do you think the two approaches are complementary? Limit your answer to 8-12 sentences!Solution Sketch1) Both are interested in mining association rules2) APRIORI centers on high support, high confidence rules; COHEN/DATAR centers on high confidence, high correlation rules3) Both try need to find efficient ways to compute item sets that satisfy minimumconfidence requirements4) Phase3 in COHEN/DATAR (scanning the


View Full Document

UH COSC 6340 - COSC 6340 Qualifying Exam Part2

Download COSC 6340 Qualifying Exam Part2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view COSC 6340 Qualifying Exam Part2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view COSC 6340 Qualifying Exam Part2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?