Princeton University • COS 226 • Algorithms and Data Structures • Spring 2004 • Kevin Wayne • http://www.Princeton.EDU/~cos226HashingReference: Chapter 14, Algorithms in Java, 3rdEdition, Robert Sedgewick.Hash functionsSeparate chainingLinear probingDouble hashing2Optimize JudiciouslyReference: Effective Javaby Joshua Bloch."More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity." - William A. Wulf"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." - Donald E. Knuth"We follow two rules in the matter of optimization:Rule 1: Don't do it.Rule 2 (for experts only). Don't do it yet - that is, not until you havea perfectly clear and unoptimized solution."- M. A. Jackson3Hashing: Basic Plan.Save items in a key-indexed table. Index is a function of the key.Hash function. Method for computing table index from key.Collision resolution strategy. Algorithm and data structure to handle two keys that hash to the same index.Classic space-time tradeoff.n No space limitation: trivial hash function with key as address.n No time limitation: trivial collision resolution = sequential search.n Limitations on both time and space: hashing (the real world)4Choosing a Good Hash FunctionGoal: scramble the keys.n Each table position equally likely for each key.Ex: Social Security numbers.n Bad: first three digits.n Better: last three digits.Ex: date of birth.n Bad: first three digits of birth year.n Better: birthday.Ex: phone numbers.n Bad: first three digits.n Better: last three digits.573 = California, 574 = Alaskaassigned in chronological order within a given geographic regionthoroughly researched problem198 for all of youless collisions even with only366 possible values5Hash Function: String KeysStrings hash functions.n Java 1.1: calculation involving only 16 characters.n Java 1.2: calculation involving all characters.n Equivalent to h = 31N-1s0+ . . . + 312s2+ 31s1+ sN-1.n Can we use h % M as index for table of size M?Work to hash a string of length W.n W add, W multiply, 1 mod.n Note: reference Java implementation caches String hash codes.public int hashCode() {int hash = 0;for (int i = 0; i < length(); i++)hash = (31 * hash)+charAt(i);return hash;}String.javas = "call";h = s.hashCode();hash = h % M; 304598271218191Horner's method6CollisionsCollision = two keys hashing to same value.n Essentially unavoidable.n Birthday problem: how many people will have to enter a room until two have the same birthday? 23n With M hash values, expect a collision after sqrt(p M/2) insertions.Conclusion: can't avoid collisions unlessyou have a ridiculous amount of memory.Challenge: efficiently cope with collisions.7Collision Resolution.Two main approaches.Separate chaining.n M much smaller than N.n ~N / M keys per table position.n Put keys that collide in a list.n Need to search lists.Open addressing.n M much larger than N.n plenty of empty table slots.n When a new key collides, find nextempty slot and put it there.n Complex collision patterns.jocularly seriouslylistenbrowsingst[0]st[1]st[2]st[8190]suburban untravelledst[3]consideratingnullM = 8191, N = 15000 listenbrowsingst[2]st[3]st[4]st[8190]suburbanst[5]nulljocularlyseriouslyst[0]st[1]nullM = 30001, N = 15000 8Separate ChainingSeparate chaining: array of M linked lists.n Hash: map key to integer i between 0 and M-1.n Insert: put at front of ith chain.n Search: only need to search ithchain.jocularly seriouslylistenbrowsingst[0]st[1]st[2]st[8190]3untravelled3suburban5017ishmael0seriously.. . .34807121hashmecallkeysuburban untravelledst[3] consideratingnullM = 8191constant timeproportional to length of chain9Symbol Table: Hash Table Implementationpublic class SymbolTable {private int M = 8191;private List[] st = new List[M];private class List { AS BEFORE }public static int hash(String s, int M) {return (s.hashCode() & 0x7fffffff)% M;}void put(String k, Object val) {int i = hash(k, M);st[i]=new List(k, val, st[i]);} Object get(String k) {int i = hash(k, M);for (List x = st[i]; x != null; x = x.next)if (k.equals(x.key)) return x.value;return null;}}number of chains (8191 is prime)insert at front of ithchainexhaustively search ithchain for keybetween 0 and M-1hex10Hash Table Implementation: PerformanceAdvantages: fast insertion, fast search.Disadvantage: hash table has fixed size.Hash tables improves ALL symbol table clients.n Makes difference between practical solution and no solution.n Ex: Moby Dick now takes a few seconds instead of hours.% java DeDup < mobydick.txtmobydickhermanmelvillecallmeishmaelsomeyearsago. . .210,028 words16,834 distinctcorrected by doubling the size of the array and rehashing all of the key-value pairs11Separate Chaining PerformanceSeparate chaining performance.n Search cost is proportional to length of chain.n Trivial: average length = N / M.n Worst case: all keys hash to same chain.Theorem. Let a = N / M > 1 be average length of list. For any t > 1, probability that list length > t a is exponentially small in t.Parameters.n M too large Þ too many empty chains.n M too small Þ chains too long.n Typical choice: a = N / M ~ 10 Þ constant-time search/insert.depends on hash map being random map12Symbol Table: Implementations Cost SummarySorted arrayImplementationUnsorted listlog NSearchNNInsert1log NSearchN / 2N / 2Insert1N / 2Delete1Worst Case Average CaseNDelete1Hashing N 1 1* 1* 1*N* assumes hash function is random13Linear ProbingLinear probing: array of size M.n Hash: map key to integer i between 0 and M-1.n Insert: put in slot i if free, if not try i+1, i+2, etc.n Search: search slot i, if occupied but no match, try i+1, i+2, etc.Cluster.n Contiguous block of items.n Search through cluster using elementary algorithm for arrays.typically twice as many slots as elements14Linear Probing PerformanceLinear probing performance.n Insert and search cost depend on length of cluster.n Trivial: average length of cluster = a = N / M.n Worst case: all keys hash to same cluster.Theorem (Knuth, 1962). Let a = N / M < 1 be average length of list.Parameters.n M too large Þ too many empty array entries.n M too small Þ clusters coalesce.n Typical choice: M ~ 2N Þ constant-time search/insert.depends on hash map being random mapbut elements more likely to hash to big clusters15Double
View Full Document