3/4/20081Introduction to Algorithms6.046J/18.401JLecture 8 - HashingProf. Manolis Kellis3/4/2008 L8.2© Leiserson, Indyk, KellisHashing lecture outlineIntro and definitionHashing in practiceUniversal hashingPerfect hashingOpen Addressing (optional)3/4/2008 L8.3© Leiserson, Indyk, KellisData Structures• Role of data structures:– Encapsulate data– Support certain operations (e.g., INSERT, DELETE, SEARCH)• Our focus: efficiency of the operations• Algorithms vs. data structures3/4/2008 L8.4© Leiserson, Indyk, KellisSymbol-table problemSymbol table T holding n records:key[x]key[x]recordxOther fields containing satellite dataOperations on T:• INSERT(T, x)• DELETE(T, x)• SEARCH(T, k)How should the data structure T be organized?3/4/2008 L8.5© Leiserson, Indyk, KellisDirect-access tableIDEA: Suppose that the set of keys is K ⊆ {0, 1, …, m–1}, and keys are distinct. Set up an array T[0 . . m–1]: T[k] =x if k ∈ K and key[x] = k,NIL otherwise.Then, operations take Θ(1) time.Problem: The range of keys can be large. Eg.:• 64-bit numbers (which represent 18,446,744,073,709,551,616 different keys)• Character strings (even larger!).3/4/2008 L8.6© Leiserson, Indyk, KellisAs each key is inserted, h maps it to a slot of T.Hash functionsSolution: Use a hash function h to map the universe U of all keys into{0, 1, …, m–1}:UKk1k2k3k4k50m–1h(k1)h(k4)h(k2)h(k3)When a record to be inserted maps to an already occupied slot in T, a collision occurs.T= h(k5)3/4/200823/4/2008 L8.7© Leiserson, Indyk, KellisResolving collisions by chaining• Records in the same slot are linked into a list.h(49) = h(86) = h(52) = iT494986865252i• Run time analysis requires key assumption:3/4/2008 L8.8© Leiserson, Indyk, KellisSimple uniform hashingWe make the assumption of simple uniform hashing:• Each key k ∈ K of keys is equally likely to be hashed to any slot of table T, independent of where other keys are hashed.Let n be the number of keys in the table, and let m be the number of slots.Define the load factor of T to beα = n/m= average number of keys per slot.3/4/2008 L8.9© Leiserson, Indyk, KellisSearch cost for chaining under simple uniform hashingExpected time to search for a record with a given key = Θ(1 + α).apply hash function and access slotsearch the listExpected search time = Θ(1) if α = O(1), or equivalently, if n = O(m).3/4/2008 L8.10© Leiserson, Indyk, KellisIntro and definitionHashing in practiceUniversal hashingPerfect hashingOpen Addressing (optional)3/4/2008 L8.11© Leiserson, Indyk, KellisChoosing a hash functionThe assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice as long as their deficiencies can be avoided.Desirata:• A good hash function should distribute the keys uniformly into the slots of the table.• Regularity in the key distribution should not affect this uniformity.3/4/2008 L8.12© Leiserson, Indyk, Kellish(k)1. Division methodAssume all keys are integers, and defineh(k) = k mod m.Extreme deficiency: If m = 2r, then the hash doesn’t even depend on all the bits of k:• If k = 10110001110110102and r = 6, then h(k) = 0110102 .Deficiency: Don’t pick an m that has a small divisor d. A preponderance of keys that are congruent modulo d can adversely affect uniformity.3/4/200833/4/2008 L8.13© Leiserson, Indyk, Kellis1. Division method (continued)h(k) = k mod m.Pick m to be a prime not too close to a power of 2 or 10 and not otherwise used prominently in the computing environment.Annoyance:• Sometimes, making the table size a prime is inconvenient.But, this method is popular, although the next method we’ll see is usually superior.3/4/2008 L8.14© Leiserson, Indyk, Kellis2. Multiplication methodAssume that all keys are integers, m = 2r, and our computer has w-bit words. Define h(k) = (A·k mod 2w) rsh (w – r),where rsh is the “bit-wise right-shift” operator and A is an odd integer in the range 2w–1< A < 2w.• Don’t pick A too close to 2w.• Multiplication modulo 2wis fast.• The rsh operator is fast.3/4/2008 L8.15© Leiserson, Indyk, Kellis40352617Modular wheel2. Multiplication method exampleh(k) = (A·k mod 2w) rsh (w – r)Suppose that m = 8 = 23and that our computer has w = 7-bit words:1 0 1 1 0 0 1× 1 1 0 1 0 1 11 0 0 1 0 1 0 0 1 1 0 0 1 1= A= kh(k)A..2A..3A..Don’t pick A too close to 2w(explore more of the wheel space)3/4/2008 L8.16© Leiserson, Indyk, Kellis2. Multiplication method example1101011110101111010111101011110101111010111101011001100111010011 011 001h(k)A=k=h(k) = (A·k mod 2w) rsh (w – r)wNever computed(mult. mod 2w)rDon’t pick A too close to 2w(have balance of 1s and 0s)3/4/2008 L8.17© Leiserson, Indyk, Kellis3. Dot-product methodRandomized strategy:•Let m be prime. • Decompose k into r+1digits, each in {0,1,..m-1}. • Pick random vector a, similarly decomposed (each aichosen randomly from {0,1,..m-1} ). • Calculate dot product k·a, each multiplication mod m• Excellent in practice, but expensive to compute.mkakhriiiamod)(0∑==…krkr-1k2k1k0…arar-1a2a1a0…arkr…a2k2a1k1a0k0Each value 0 ≤ ki< m, with m prime.·m·m·m·m·m3/4/2008 L8.18© Leiserson, Indyk, KellisIntro and definitionHashing in practiceUniversal hashingPerfect hashingOpen Addressing (optional)3/4/200843/4/2008 L8.19© Leiserson, Indyk, KellisA weakness of hashingProblem: For any hash function h, a set of keys exists that can cause the average access time of a hash table to skyrocket.IDEA: Choose the hash function at random, independently of the keys.• Even if an adversary can see your code, they cannot find a bad set of keys, since they don’t know exactly which hash function will be chosen.• Adversary can pick all keys st. {k∈U:h(k)=i}for some slot i (e.g. denial-of-service attacks).3/4/2008 L8.20© Leiserson, Indyk, KellisUniversal hashing: Definition• Let U be a universe of keys. • Let H be a finite collection of hash functions, each mapping U to {0, 1, …, m–1}.• H is universal if for all x, y ∈ U, where x ≠ y, we have |{h ∈H: h(x) = h(y)}| = |H|/m, i.e. only 1/m of hash functions in H result in x,y collision. • Thus, if we choose hrandomly from H, the chance of a collision between x and y is 1/mH{h : h(x) = h(y)}|H|m3/4/2008 L8.21© Leiserson, Indyk, KellisUniversality is goodTheorem. Let h be a hash function chosen (uniformly) at random from a universal set Hof hash functions. Suppose h is used to hash n
or
We will never post anything without your permission.
Don't have an account? Sign up