Join Processing in Database Systems with Large Main Memories LEONARD D SHAPIRO North Dakota State University We study algorithms for computing the equijoin of two relations in B system with a standard architecturehut with largeamountsof main memory Our algorithmsare especiallyefficientwhen the mainmemoryavailableis B significantfractionof the sizeof oneof the relationsto hejoined hut theycanbeappliedwheneverthereis memoryequalto approximatelythe 8qumeroot of the size of one relation We presenta newalgorithmwhich is B hybrid of two hash based algorithmsand whichdominatesthe other algorithmawepresent includingsort merge Evenin B virtual memory environment the hybridalgorithmdominatesall the otherswestudy Finally wedescribehowthreepopulartoolsto increasethe efficiencyofjoins namelyfilters Babb arrays andsemijoins canhegraftedontoany of our algorithms Categoriesand SubjectDescriptors H 2 0 Database Management General H 2 4 Database Management Systems queryprocessing H 2 6 Database Management Database Machines GeneralTerms Algorithms Performance AdditionalKeyWordsand Phrases Hashjoin join processing largemainmemory sort merge join 1 INTRODUCTION Database systems are gaining in popularity owing to features such as data independence high level interfaces concurrency control crash recovery and so on However the greatest drawback to database management systems other than cost is the inefficiency of full function database systems compared to customized programs and one of the most costly operations in database processing is the join Traditionally the most effective algorithm for executing a join if there are no indices has been sort merge 4 In S it wae suggested that the existence of increasingly inexpensive main memory makes it possible to use hashing techniques to execute joins more efficiently than sort merge Here we extend these results Some of the first research on joins using hashing 14 211 concerned multiprocessor architectures Our model assumes a vanilla computer architecture that is a uniprocessor system available in the market today Although the lack of parallel processing in such systems deprives us of much of the potential speed of Author saddress Departmentof ComputerScience North DakotaStateUniversity Fargo ND 58105 Permissionto copywithoutfeeall or part of this materialis grantedprovidedthat the copiesarenot madeor distributedfor directcommercialadvantage the ACM copyrightnoticeandthe title of the publicationandits dateappear andnoticeis giventhat copyingis hy permissionof the Association for ComputingMachinery To copy otherwise or to republish require a fee and or specific permission 0 986 ACM 0362 5915 86 0900 0239 00 75 240 Leonard D Shapiro join processing on multiprocessor systems our algorithms can he implemented on current systems and they avoid the complex synchronization problems of some of the more sophisticated multiprocessor algorithms Our algorithms require significant amounts of main memory to execute most efficiently We assume it is not unreasonable to expect that the database system can assign several megabytes of buffer space to executing a join Current VAX systems can support 32 megabytes of real memory with 64K chips a and it is argued in lo that a system can he built with existing technology that will support tens of gigabytes of main memory and which appears to a programmer to have a standard architecture We will see that our algorithms are most effective when the amount of real memory available to the process is close to the size of one of the relations The word large in the title refers to a memory size large enough that it is not uncommon for all or a significant fraction of one of the relations to be joined to fit in main memory This is because the minimum amount of memory required to implement our algorithms is approximately the square root of the size of one of the relations measured in physical blocks This allows us to process rather large relations in main memories which by today s standards might not he called large For example using our system parameters and 4 megabytes of real memory as buffer space we can join two relations using our most efficient algorithm if the smaller of the two relations is at most 325 megabytes We show that with sufficient main memory and for sufficiently large relations the most efficient algorithms are hash based We present two classes of hashbased algorithms one simple hash that is most efficient when most of one relation fits in main memory and another GRACE that is most efficient when much less of the smaller relation fits We then describe a new algorithm which is a hybrid of simple and GRACE that is the most efficient of those we study even in a virtual memory environment This is in contrast to current commercial database systems which find sort merge join to be most efficient in many situations and do not implement hash joins In Section 2 we present four algorithms for computing an equijoin along with their cost formulas The first algorithm is sort merge modified to take advantage of large main memory The next is a very simple use of hashing and another is based on GRACE the Japanese fifth generation project s database machine 14 The last is a hybrid of the simple and GRACE algorithms We show in Section 3 that for sufficiently large relations the hybrid algorithm is the most efficient inclusive of sort merge and we present some analytic modeling results All our hashing algorithms are based on the idea of partitioning we partition each relation into subsets which can on the average fit into main memory In Section 3 we assume that all the partitions can fit into main memory and in Section 4 discuss how to deal with the problem of partition overflow In Section 5 we describe the effect of using virtual memory in place of some main memory In Section 6 we discuss how to include in our algorithms other tools that have become popular in database systems namely selection filters semijoin strategies and Babb arrays A description of these algorithms similar to that in Section 2 and analytic modeling results similar to those in the second half of Section 3 appeared in 6 ACMTrsnsactions onDatabase System Vu 11 No 3 September 1986 Join Processing in Database Systems 241 In 151 it is shown that hashing is preferable to nested loop and sort merge algorithms for a variety of relational algebra operations results consistent with those we present In 7 the results of 6 are extended to the multiprocessor environment and
View Full Document