UH COSC 6340 - COSC 6340 Qualifying Exam Part2 - D3004650

Home> Schools> University of Houston> Computer Science (COSC) > COSC 6340> COSC 6340 Qualifying Exam Part2

DOC PREVIEW

UH COSC 6340 - COSC 6340 Qualifying Exam Part2

School name University of Houston

Course Cosc 6340- Database Systems

Pages 7

This preview shows page 1-2 out of 7 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 7 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

M + (M*N)/(B-2)= 500 + 500*1000/6=…Solution SketchesQualifying Exam Part2COSC 6340 (Data Management)May 10, 2002Your Name:Your SSN:Your Mailing Address:Problem 1 [15]: Implementing JoinsProblem 2 [10]: B+-treesProblem 3 [18]: Query OptimizationProblem 4 [9]: Association RulesProblem 5 [21]: Multidimensional Database Technology:Grade:The exam is “open books/notes” and you have 105 minutes to complete the exam. 11) Implementing Joins [15]Assume we have to implement a natural join between two relations R1(A, B, C) and R2(A, D). A is the primary key for R2 and R1.A is a foreign key for R1 (R1[A] . R2[A]). R1 and R2 are stored as an unordered file (blocks are assumed to be 100% full); R1 contains 200000 tuples that are stored in 500 blocks, and R2 has 50000 tuples that are stored in 1000 blocks (that is, every R2.A value occurs at an average four time as a value of R1.A). Moreover, assume that only a small buffer size of 8 blocks is available. Based on these assumptions answer the following questions (indicate, if you assume in your computations that the output relation is written back to disk or not): a. How many tuples will the output relation R=R1  R2 contain? [1]200000b. What is the cost for implementing the natural join using the block-nested loop join? [2]M + (M*N)/(B-2)= 500 + 500*1000/6=…c. Now assume that either a hashed index on A of R1 or a hashed index on A of R2 is available (assume that there are no overflow pages). Compute the cost for using the indexnested loops join using the index for R1.A and for using the index nested loops join for R2.A (2 computations][4]500 + 200000* (1+1)=…1000+ 50000* (1+4)=…d. Is it possible to apply the hash-join in this case (explain your answer!)? [2]No, 8>sqrt(500)e. Which of the 4 proposed implementations is the best? [1]bDefining an index on f. Now assume that the response time of the methods you evaluated is not satisfactory. What else could be done, to speed up? [5]Change the database design, create a new relation Relation R= R2 OUTERJOIN R1 Cost reduces to …(answer omitted) 22) B+ Tree Index Structures [10]Assume a relation R(A, B, C, D) is given; R is stored as an unordered file and contains 1000000 (1 million) tuples. Attributes A, B, C, D need 4 byte of storage each, and blocks have a size of 4096 Byte. Moreover, the following query is given:Q1: Select A,C,D from R where B<12222 and B> 1300 returns 30000 answers a) What is the cost of implementing Q1 assuming no index structures are available? [2]Scan R; the cost of 3907 block accessesb) Now assume a B+-tree index using attribute B as the search key is created. What parameters p (maximum of pointers in an intermediate node) and m (number of entries in a leaf node) would be chosen for the B+-tree and what will be the height of the B+-tree (assuming that each node of the tree are filled 60%)? You can assume thatB+-tree node pointers require 4 byte of storage and index pointers also require 4 byte of storage. Give reasons for your answers! [4]p=m=512; p*m*(0.6**2)< 10000 and p*p*m*(0.6**3) >10000; therefore 3 levelsc) Based on your answers to b), compute the cost for Q1 assuming the B+-tree index is used. Explain your computations! [4].2 (access B+ tree intermediate nodes)+ 98 (retrieve blocks of leaf nodes that satisfy the selection range in the query) + 30000 (access to blocks of relation R) =3010033) Query Optimization [18]Assume 4 relations R1(A,B) R2(A,C) R3(A,D) and R4(A,E) and the following SQL query is given: SELECT B, C, D FROM R1, R2, R3, R4 WHERE R1.A=R2.A and R2.A=R3.A and R3.A=R4.A.a) Give a query execution plan (Chaudhuri calls those physical operator trees) that implements the above query (does not need to be the most efficient one)[3]:b) The system R SQL optimizer only considers linear plans for query execution. Give an example of a plan for the above query that would not be considered by the SQL optimizer[2].Same as a)c) Using the above query as an example, explain the principle of optimality (explained on page 2 of the Chaudhuri paper). How does it help in reducing the number of plans that need to be evaluated by the query optimizer? [5]Principle of optimality: For plans for queries that involve k subexpressions it suffices to find optimal plans for a single subexpression, and then extend this plan by finding optimal plans for the remaining subexpressions (one by one)Example (see page 2 of paper)PoO reduces plans to be considered from O(n!) to O(n*2n-1 )d) Another critical problem in query optimization is the propagation of statistical information. Explain what this problem is by using the query plan you generated for sub problem a) as an example. [4]In order to evaluate the cost of the plan ((R1 NJOIN R2) NJOIN ((R3 NJOIN R4)) it is necessary to predict the size and statistical properties of the two relations that are obtained by joining R1 and R2, and by joining R3 and R4. Thesize and statistical properties of those intermediate relations have to be predictedby “propagating the statistical of R1, R2, R3, R4”; then this information is used to determine the cost of the third join operation4e) Chaudhuri states that “a desirable optimizer is one where … the enumeration algorithm is efficient”. What does this requirement mean? [4]The enumeration algorithm searches the implementation plan space for a query for a “good” or optimal solution; to be efficient it needs to have the following characteristics:a. it construct plans to be considered fastb. it employs heuristics that give preference to considering “good” plans prior to “bad” plansc. It is complete in the sense that all “promising” plans are considered4) Association Rule Mining [9]Compare the approach to association rule mining that relies on support, confidence, and APRIORI style algorithms with the approach that is advocated by the Cohen/Datar Paper.What are the main differences between the two approaches? What are the similarities of the two approaches? Do you think the two approaches are complementary? Limit your answer to 8-12 sentences!Solution Sketch1) Both are interested in mining association rules2) APRIORI centers on high support, high confidence rules; COHEN/DATAR centers on high confidence, high correlation rules3) Both try need to find efficient ways to compute item sets that satisfy minimumconfidence requirements4) Phase3 in COHEN/DATAR (scanning the

View Full Document