DOC PREVIEW
MIT 6 830 - Study Notes

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Join Processing in Database with Large Main Memories LEONARD D. SHAPIRO North Dakota State University Systems We study algorithms for computing the equijoin of two relations in B system with a standard architecture hut with large amounts of main memory. Our algorithms are especially efficient when the main memory available is B significant fraction of the size of one of the relations to he joined; hut they can be applied whenever there is memory equal to approximately the 8qume root of the size of one relation. We present a new algorithm which is B hybrid of two hash-based algorithms and which dominates the other algorithma we present, including sort-merge. Even in B virtual memory environment, the hybrid algorithm dominates all the others we study. Finally, we describe how three popular tools to increase the efficiency ofjoins, namely filters, Babb arrays, and semijoins, can he grafted onto any of our algorithms. Categories and Subject Descriptors: H.2.0 (Database Management]: General; H.2.4 [Database Management]: Systems-queryprocessing; H.2.6 [Database Management]: Database Machines General Terms: Algorithms, Performance Additional Key Words and Phrases: Hash join, join processing, large main memory, sort-merge join 1. INTRODUCTION Database systems are gaining in popularity owing to features such as data independence, high-level interfaces, concurrency control, crash recovery, and so on. However, the greatest drawback to database management systems (other than cost) is the inefficiency of full-function database systems, compared to customized programs; and one of the most costly operations in database process- ing is the join. Traditionally the most effective algorithm for executing a join (if there are no indices) has been sort-merge [4]. In [S] it wae suggested that the existence of increasingly inexpensive main memory makes it possible to use hashing techniques to execute joins more efficiently than sort-merge. Here we extend these results. Some of the first research on joins using hashing [14, 211 concerned multipro- cessor architectures. Our model assumes a “vanilla” computer architecture, that is, a uniprocessor system available in the market today. Although the lack of parallel processing in such systems deprives us of much of the potential speed of Author’s address: Department of Computer Science, North Dakota State University, Fargo, ND 58105. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is hy permission of the Association for Computing Machinery. To copy otherwise, or to republish, require a fee and/or specific permission. 0 ,986 ACM 0362.5915/86/0900-0239 $00.75240 * Leonard D. Shapiro join processing on multiprocessor systems, our algorithms can he implemented on current systems, and they avoid the complex synchronization problems of some of the more sophisticated multiprocessor algorithms. Our algorithms require significant amounts of main memory to execute most efficiently. We assume it is not unreasonable to expect that the database system can assign several megabytes of buffer space to executing a join. (Current VAX systems can support 32 megabytes of real memory with 64K chips [a]; and it is argued in [lo] that a system can he built with existing technology that will support tens of gigabytes of main memory, and which appears to a programmer to have a standard architecture.) We will see that our algorithms are most effective when the amount of real memory available to the process is close to the size of one of the relations. The word “large” in the title refers to a memory size large enough that it is not uncommon for all, or a significant fraction, of one of the relations to be joined to fit in main memory. This is because the minimum amount of memory required to implement our algorithms is approximately the square root of the size of one of the relations (measured in physical blocks). This allows us to process rather large relations in main memories which by today’s standards might not he called large. For example, using our system parameters and 4 megabytes of real memory as buffer space, we can join two relations, using our most efficient algorithm, if the smaller of the two relations is at most 325 megabytes. We show that with sufficient main memory, and for sufficiently large relations, the most efficient algorithms are hash-based. We present two classes of hash- based algorithms, one (simple hash) that is most efficient when most of one relation fits in main memory and another (GRACE) that is most efficient when much less of the smaller relation fits. We then describe a new algorithm, which is a hybrid of simple and GRACE, that is the most efficient of those we study, even in a virtual memory environment. This is in contrast to current commercial database systems, which find sort-merge-join to be most efficient in many situations and do not implement hash joins. In Section 2 we present four algorithms for computing an equijoin, along with their cost formulas. The first algorithm is sort-merge, modified to take advantage of large main memory. The next is a very simple use of hashing, and another is based on GRACE, the Japanese fifth-generation project’s database machine [14]. The last is a hybrid of the simple and GRACE algorithms. We show in Section 3 that for sufficiently large relations the hybrid algorithm is the most efficient, inclusive of sort-merge, and we present some analytic modeling results. All our hashing algorithms are based on the idea of partitioning: we partition each relation into subsets which can (on the average) fit into main memory. In Section 3 we assume that all the partitions can fit into main memory, and in Section 4 discuss how to deal with the problem of partition overflow. In Section 5 we describe the effect of


View Full Document
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?