DOC PREVIEW
Berkeley COMPSCI 186 - PostgreSQL Query Processor: Homework Assignment 2

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Overview: Hybrid Hashing for Grouped AggregatesAdministriviaTasksHash Grouped Aggregate DescriptionCode WalkthroughObtain, Compile, and Test PostgreSQLCode WalkthroughHashtable InitializationDetermining the Number of PartitionsTemporary filesManaging Memory with ContextsAllocating a per-run contextPerformance study: Sorting vs HashingResourcesWhat to submitPostgreSQL Query Processor: Homework Assignment 2CS186 Introduction to Database SystemsUC BerkeleyFebruary 23, 2006Due: March 14, 20061 Overview: Hybrid Hashing for Grouped AggregatesIn Project 1, you studied how to change the page replacement policy of the PostgreSQL buffer manager. In thisproject you will move to a higher level in the syste m and add functionality to the PostgreSQL query executor.We will restrict our focus to grouped aggregates. This project will be considerably more complex than Project 1,especially in terms of the amount of existing code you need to understand. The major parts of the project are:1. A “big picture” understanding of a query executor2. An understanding of different aggregation strategies3. Examining, understanding, and enhancing existing codeIn lecture, will see how grouped aggregates can be implemented with sorting as well as hashing. We are providingyou with a PostgreSQL version that supports sorting and in-memory hashing. Your job is to understand hashinggroup e d aggregate code that we provide, and enhance it to deal with insufficient memory by spilling data to disk.1.1 AdministriviaAs with Project 1, we will provide you a leaner PostgreSQL source tree. Due to space constraints, you will needto move your homework 1 work off your student class account (though we recommend saving it somewhere in casethere are grading problems). Even then, your PostgreSQL homework 2 tree and the databases you need will notfit in your account. To save some space, when you will decompress the project files into your accounts (see section3.1), you will notice that all source files you do not need to modify are symbolinccally linked to a central sourcedistribution. Use the mkhometmpdir command (see /home/tmp/README.txt for more information) and put yourPostgreSQLdata directory (e.g. pgdata) in your /home/tmp/<username> directory. Then, edit your .cshrc andadd the following line:setenv PGDATA /home/tmp/<username>/pgdata.If this directory exists from your work on project 1, empty it.Make sure your code runs smoothly on EECS instructional machines prior to submission since ALL gradingwill be done on those. This code for the instructional machines will only run correctly on the Solaris x86 boxesthat run version 9. Right now, these boxes are pentagon, rhombus, and cube. As before, you may find it easier todevelop your code at home or on other machines you have access to. We have a sp ec ial distribution for those whowant to develop on machines other than the instructional machines. If you choose this option, please make sureyour modified source files work on the instructional machines with the distribution meant for these machines. Youshould also run your performance experiments on the instructional machines, as you may see different performancedue to different system libraries, etc. Note that the instructions provided are designed to work with the EECSinstructional machines and may not work as smoothly elsewhere. The TAs and Professor will only support thesemachines.1.2 TasksThe following table lists the various tasks you will have to complete while working on this assignment.1Task Name Credit1 Compile and test our version of PostgreSQL. A donut2 Study and understand the hash based aggregation impleme ntation provided A donut3 Enhance hashed aggregation to spill to disk w hen required. 75%4 Compare the performance of sorted and hashed aggregation 25%Please remember that you only have 3 weeks. This project will take a fair amount of time. Spend some time toread this document completely before beginning work. Start early and avoid a last-minute scramble!Since understanding the code is an important part of this project, the TAs and Professorwill not assist you in understanding the existing code beyond what is discussed here.This handout is as follows. First, we des cribe in Section 2 the hashing grouped aggregation algorithm, and thespilling algorithm that you need to implement. In Section 3, we give a brief walkthrough of the code you have, andSection 4 describes the performance study you have to complete. Section 5 gives some additional resources thatyou might find interesting, and Section 6 tells you what (and how) you need to submit.2 Hash Grouped Aggregate DescriptionIn this section, we first describe the hash grouped aggregate algorithm in the source we give you. We then describethe algorithm that you need to implement, which is a hash algorithm that spills to disk when the hash table memorysize is exceeded. These algorithms will be covered in class sometime in the next week, and the notes should soonbe available.Initialize the hash tablewhile (tuple ← get next() 6= NULL) do(hashKey, exists) ← lookup hash table(tuple)if exists thenupdate trans-value in hashelseinitialize new trans-valueinsert (hashKey, trans-value) in hashendifendwhilefor each (group-key, trans-value) in hash doappend finalize(trans-value) to output streamendforThe pseudo c ode for the hashing algorithm provided in the code is shown above (we detail later how this isimplemented in the actual source). The aggregate algorithm runs in an operator that takes tuples from the operatorbelow it in the query plan using the get next() call. For each input tuple, it performs a hash table lookup usingthe tuple’s group by field(s). If an entry exists, this entry is a transition value, which is a set of information aboutthe aggregates of the tuples seen so far for a given group. For a maximum aggregate, for instance, the transitionvalue would keep the maximum value of the tuples seen so far for a given group. If such a transition value existsfor the tuple, it is updated with the tuple. Otherwise, a new transition value is created and inserted into the hashtable. Once all tuples are processed, the algorithm finalizes each transition value in the hashtable and outputs it.This works fine as long as the hash table is small enough to fit in the memory allocated to it (call it HashMem). Ifnot, the system must use another algorithm that uses the disks. We show such an algorithm (which you have toimplement) on the next page.2// Initial passInitialize the


View Full Document

Berkeley COMPSCI 186 - PostgreSQL Query Processor: Homework Assignment 2

Documents in this Course
Load more
Download PostgreSQL Query Processor: Homework Assignment 2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view PostgreSQL Query Processor: Homework Assignment 2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view PostgreSQL Query Processor: Homework Assignment 2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?