Quick Review of material covered Apr 8 B Tree Overview and some definitions balanced tree multi level reorganizes itself on insertion and deletion built so each node fits on a single disk page Examined mechanics of B tree Insertion and Deletion looked at several examples We ll finish up B trees with two more concepts B tree File Organization B tree index files B tree File Organization B Tree Indices solve the problem of index file degradation The original data file will still degrade upon a stream of insert delete operations Solve data file degradation by using a B tree file organization Leaf nodes in a B tree file organization store records not pointers into a separate original datafile since records are larger than pointers the maximum number of recrods that can be stored in a leaf node is less than the number of pointers in a non leaf node leaf nodes must still be maintained at least half full insert and delete are handled in the same was as insert and delete for entries in a B tree index B tree File Organization Example Records are much bigger than pointers so good space usage is important To improve space usage involve more sibling nodes in redistribution during splits and merges to avoid split merge when possible involving one sibling guarantees 50 space use involving two guarantees at least 2 3 space use etc B tree Index Files B trees are similar to B trees but search key values appear only once in the index eliminates redundant storage of key values search keys in non leaf nodes don t appear in the leaf nodes so an additional pointer field for each search key in a non leaf node must be stored to point to the bucket or record for that key value leaf nodes look like B tree leaf nodes P1 K1 P2 K2 Pn non leaf nodes look like so P1 B1 K1 P2 B2 K2 Pn where the Bi are pointers to buckets or file records B tree Index File Example B tree and B tree B tree Index Files cont Advantages of B tree Indices vs B trees May use less tree nodes than a B tree on the same data Sometimes possible to find a specific key value before reaching a leaf node Disadvantages of B tree Indices Only a small fraction of key values are found early Non leaf nodes are larger so fanout is reduced and B trees may be slightly taller than B trees on the same data Insertion and deletion are more complicated than on B trees Implementation is more difficult than B trees In general advantages don t outweigh disadvantages Hashing We ve examined Ordered Indices design based upon sorting or ordering search key values the other type of major indexing technique is Hashing Underlying concept is very simple observation small files don t require indices or complicated search methods use some clever method based upon the search key to split a large file into a lot of little buckets each bucket is sufficiently small use the same method to find the bucket for a given search key Hashing Basics A bucket is a unit of storage containing one or more records typically a bucket is one disk block in size In a hash file organization we find the bucket for a record directly from its search key value using a hash function A hash function is a function that maps from the set of all searchkey values K to the set of all bucket addresses B The hash function is used to locate records for access insertion and deletion Records with different search key values may be mapped to the same bucket the entire bucket must be searched to find a record buckets are designed to be small so this task is usually not onerous Hashed File Example So we divide the set of disk blocks that make up the file into buckets devise a hash function that maps each key value into a bucket V set of key values B number of buckets H hashing function H V 0 1 2 3 B 1 Example V 9 digit SS B 1000 H key modulo 1000 Hash Functions To search insert delete modify a key do compute H k to get the bucket number search sequentially in the bucket heap organization within each bucket Choosing H almost any function that generates random numbers in the range 0 B 1 try to distribute the keys evenly into the B buckets one rule of thumb when using MOD use a prime number Hash Functions 2 Collision is when two or more key values go to the same bucket too many collisions increases search time and degrades performance no or few collisions means that each bucket has only one or very few key s Worst case hash functions map all search keys to the same bucket Hash Functions 3 Ideal hash functions are uniform each bucket is assigned the same number of search key values from the set of all possible values Ideal hash functions are random each bucket has approximately the same number of records assigned to it irrespective of the actual distribution of search key values in the file Finding a good hash function is not always easy Examples of Hash Functions Given 26 buckets and a string valued search key consider the following possible hash functions Hash based upon the first letter of the string Hash based upon the last letter of the string Hash based upon the middle letter of the string Hash based upon the most common letter in the string Hash based upon the average letter in the string the sum of the letters using A 0 B 1 etc divided by the number of letters Hash based upon the length of the string modulo 26 Typical hash functions perform computation on the internal binary representation of the search key example searching on a string value hash based upon the binary sum of the characters in the string modulo the number of buckets
View Full Document
Unlocking...