Berkeley COMPSCI 186 - Implementation of Relational Operations

Unformatted text preview:

Implementation ofRelational Operations(Part 2) R&G - Chapters 12 and 14An Alternative to Sorting: Hashing!• Idea:– Many of the things we use sort for don’t exploit theorder of the sorted data– e.g.: removing duplicates in DISTINCT– e.g.: finding matches in JOIN• Often good enough to match all tuples with equalvalues• Hashing does this!– And may be cheaper than sorting! (Hmmm…!)– But how to do it for data sets bigger than memory??General Idea• Two phases:– Partition: use a hash function h to split tuples intopartitions on disk.• Key property: all matches live in the same partition.– ReHash: for each partition on disk, build a main-memory hash table using a hash function h2Two Phases• Partition:• Rehash:PartitionsHash table for partitionRi (<= B pages)B main memory buffersDiskResulthashfnh2B main memory buffersDiskDiskOriginal RelationOUTPUT2INPUT1hashfunctionhB-1Partitions12B-1. . .DuplicateEliminationusing Hashing• read one bucketat a time• for each group ofidentical tuples,output onePartitionsHash table for partitionRi (<= B pages)B main memory buffersDiskResulthashfnh2B main memory buffersDiskDiskOriginal RelationOUTPUT2INPUT1hashfunctionhB-1Partitions12B-1. . .Cost of External Hashingcost = 4*[R] IO’sMemory Requirement• How big of a table can we hash in two passes?– B-1 “partitions” result from Phase 0– Each should be no more than B pages in size– Answer: B(B-1).Said differently: We can hash a table of size N pages in about space–Note: assumes hash function distributes records evenly!• Have a bigger table? Recursive partitioning!! NHow does this compare withexternal sorting?SortingCost of Externalcost = 4*[R] IO’sHashingSortingCost of Externalcost = 4*[R] IO’sMemory Requirement forExternal Sorting• How big of a table can we sort in two passes?– Each “sorted run” after Phase 0 is of size B– Can merge up to B-1 sorted runs in Phase 1– Answer: B(B-1).Said differently: We can sort a table of size N pages in about space• Have a bigger table? Additional merge passes!! NSo which is better ??• Based on our simple analysis:– Same memory requirement for 2 passes– Same IO cost• Digging deeper …• Sorting pros:– Great if input already sorted (or almost sorted)– Great if need output to be sorted anyway– Not sensitive to “data skew” or “bad” hash functions• Hashing pros:– Highly parallelizable (will discuss later in semester)• So is sorting, with some work– Can exploit extra memory to reduce # IOs (stay tuned…)Q: Can we use hashing for JOIN ?before we optimize hashing further …Hash JoinPartitionsof R & SInput bufferfor SiHash table for partitionRi (B-2 pages)B main memory buffersDiskOutput bufferDiskJoin Resulthashfnh2h2B main memory buffersDiskDiskOriginal RelationOUTPUT2INPUT1hashfunctionhB-1Partitions12B-1. . .Cost of Hash Join• Partitioning phase: read+write both relations⇒ 2([R]+[S]) I/Os• Matching phase: read both relations, write output⇒ [R]+[S] + [output] I/Os• Total cost of 2-pass hash join = 3([R]+[S])+[output]Q: what is cost of 2-pass sort join?Q: how much memory needed for 2-pass sort join?Q: how much memory needed for 2-pass hash join?• Have B memory buffers• Want to hash relation of size NAn important optimization to hashingN# passesBB212If B < N < B2, will have unused memory …cost N3N1Hybrid Hashing• Idea: keep one of the hash buckets in memory!B main memory buffersDiskDiskOriginal RelationOUTPUT3INPUT2hB-kPartitions23B-k. . .h3k-buffer hashtableQ: how do we choose the value of k?Cost reduction due to hybrid hashing• Now:N# passesBB212cost N3NSummary: Hashing vs. Sorting• Sorting pros:– Good if input already sorted, or need output sorted– Not sensitive to data skew or bad hash functions• Hashing pros:– Often cheaper due to hybrid hashing– For join: # passes depends on size of smaller relation– Highly


View Full Document

Berkeley COMPSCI 186 - Implementation of Relational Operations

Documents in this Course
Load more
Download Implementation of Relational Operations
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Implementation of Relational Operations and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Implementation of Relational Operations 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?