Unformatted text preview:

U.C. Berkeley — CS170: Intro to CS Theory Handout N9Professor Luca Trevisan September 27, 2001Notes for Lecture 91 HashingWe assume that all the basics about hash tables have been covered in 61B.We will make the simplifying assumption that the keys that we want to hash have beenencoded as integers, and that such integers are in the range 1, . . . , M. We also assume thatcollisions are handled using linked lists.Suppose that we are using a table of size m, that we have selected a hash functionh : {1, . . . , M} → {0, . . . , m − 1} and that, at some point, the keys y1, . . . , ynhave beeninserted in the data structure, and that we want to find, or insert, or delete, the key x.The running time of such operation will be a big-Oh of the number of elements yisuch thath(yi) = h(x).No matter what h does, if M > m(n + 1), there will always be a worst-case situationwhere y1, . . . , yn, x are all mapped by h into the same bucket, and then the running time offind, insert and delete will be Θ(n). However, in practice, hash tables work well. In orderto explain the real behavior of hash tables we need to go beyond a worst-case analysis, anddo a probabilistic analysis.A simple analysis can be done assuming that the keys to be inserted in the table comefrom the uniform distribution over {1, . . . , M}. It is not hard to come up with a functionh : {1, . . . , M} → {0, . . . , m − 1} with the property that if y1, . . . , yn, x are uniformly andindependently distributed over {1, . . . , M}, then h(y1), . . . , h(yn), h(x) are uniformly (oralmost uniformly) distributed over {0, . . . , m − 1}, and then argue that, on average, thenumber of yisuch that h(x) = h(yi) is about n/m, and so an operation on x can beperformed in O(1) time assuming that, say, m = 2n.In practice, however, inputs do not come from a uniform distribution, and choices of hthat make the above proof work may or may not work well if the inputs come from differentdistributions.A much more sound approach is to consider the behavior of the data structure on arbi-trary (worst-case) inputs, but to let randomness come in the definition of the hash functionh. Then we will look at the average running time of find, insert and delete operations, butthis time the average will be over the randomness used by the algorithm, and there will beno assumption on the way the input is distributed. This kind of analysis is the object oftoday’s lecture.2 Universal HashingWe want to consider hash functions whose definition involves random choices. Equivalently,we consider families of functions, and consider the randomized process of selecting at randoma function from the family.Notes for Lecture 9 2A collection H of hash functions h : {1, . . . , M} → {0, . . . , m−1} is said to be 2-universalif for every two different x, y ∈ {1, . . . , M} we havePrh∈H[h(x) = h(y)] ≤1m(Note that the CLR/CLRS definition has ’=’ instead of ’≤’)Consider the following construction. Fix a prime p > M . It is known that there is atleast one (in fact, there are several) prime between M and 2M . A procedure for findingsuch a p could be to generate at random a number between M and 2M and then test forprimality. (There is an efficient randomized algorithms for testing primality, but we willprobably not see it in this course.)Then define, for every a ∈ {1, . . . , p − 1} and every b ∈ {0, . . . , p − 1} the functionga,b(x) = ax + b (mod p)For each such g we define a hash functionha,b(x) = ga,b(x) (mod m) ,we will prove that this construction is 2-universalLemma 1For every fixed two distinct values x, y ∈ {0, . . . , m − 1}, the number of pairs a, b such thatha,b(x) = ha,b(y) is at most p(p − 1)/m.Proof: We will consider all possible s, t ∈ {0, . . . , p − 1} such that s = t (mo d m),and then for each such pair (s, t), we will count how many pairs (a, b) are there such thatha,b(x) = s and ha,b(y) = t.The number of choices for the pair (s, t) is p (dp/me − 1) < p(p − 1)/m.For each pair s, t we are now asking how many values a, b ∈ {0, . . . , p − 1} are there thatsatisfy the following system(ax + b = s (mod p)ay + b = t (mod p)Algebra tells us that, if p is prime, the solution is unique. So the number of pairs (a, b) forwhich ha,b(x) = ha,b(y) is at most p(p − 1)/m. 2Since there are p(p − 1) functions in our family, the probability that ha,b(x) = ha,b(y) isat most 1/m, and so our family is indeed 2-universal.3 Hashing with 2-Universal FamiliesSuppose now that we pick at random h from a family of 2-universal hash functions, andwe build a hash table by inserting elements y1, . . . , yn. Then we are given a key x that wewant to find, insert or delete from the table. What will the expected number of collisionsbetween x and y1, . . . , ynbe?Notes for Lecture 9 3Lemma 2Let H be a 2-universal family of hash functions mapping {1, . . . , M} into {0, . . . , m − 1},and let x, y1, . . . , ynbe elements of {1, . . . , M}.If we pick h at random from H, then the average number of elements yisuch thath(x) = h(yi) is at most n/m; in symbolsE[|{i : h(x) = h(yi)}|] ≤ n/mProof: Call C the random variable (depending on the choice of h) that counts the numberof collisions between h(x) and h(y1), . . . , h(yn). I.e. C = |{j : h(yj) = h(x)}|.Call Cythe random variable (depending on the choice of h) that is 1 if h(x) = h(y),and 0 otherwise.Then for every yE[Cy] = 0 · Pr[h(x) 6= h(y)] + 1 · Pr[h(x) = h(y)]= Pr[h(x) = h(y)] ≤1mSince C =Pni=1Cyi, we have E[C] = E[Pni=1Cyi] =Pni=1E[Cyi] ≤ n/m 2So, if we choose m = Θ(n), each operation has O(1) average running


View Full Document

Berkeley COMPSCI 170 - Notes for Lecture 9

Download Notes for Lecture 9
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Notes for Lecture 9 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Notes for Lecture 9 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?