Copyright © 2007 by Robert Sedgewick and Kevin Wayne.1HashingReferences: Algorithms in Java, Chapter 14Intro to Algs and Data Structs, Section 4.2•hash functions•collision resolution•applicationsSummary of symbol-table implementationsCan we do better?2guaranteeaverage caseorderediteration?implementationsearchinsertdeletesearchinsertdeleteunordered arrayNNNN/2N/2N/2noordered arraylg NNNlg NN/2N/2yesunordered listNNNN/2NN/2noordered listNNNN/2N/2N/2yesBSTNNN1.39 lg N1.39 lg N?yesrandomized BST7 lg N7 lg N7 lg N1.39 lg N1.39 lg N1.39 lg Nyesred-black tree2 lg N2 lg N2 lg Nlg Nlg Nlg Nyes3Optimize JudiciouslyReference: Effective Java by Joshua Bloch.More computing sins are committed in the name of efficiency(without necessarily achieving it) than for any other single reason - including blind stupidity. - William A. WulfWe should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. - Donald E. KnuthWe follow two rules in the matter of optimization: Rule 1: Don't do it. Rule 2 (for experts only). Don't do it yet - that is, not until you have a perfectly clear and unoptimized solution. - M. A. Jackson4Hashing: basic planSave items in a key-indexed table. Index is a function of the key.Hash function. Method for computing table index from key.Collision resolution strategy. Algorithm and data structure to handle two keys that hash to the same index.Equality test. Method for checking whether two keys are equal.Classic space-time tradeoff.!No space limitation: trivial hash function with key as address.!No time limitation: trivial collision resolution with sequential search.!Limitations on both time and space: hashing (the real world).5hash functionscollision resolutionapplications6Implementing a good hash functionIdealistic goal: scramble the keys uniformly.!Efficiently computable.!Each table position equally likely for each key.Practical challenge: need different approach for each type of keyEx: Social Security numbers.!Bad: first three digits.!Better: last three digits.Ex: date of birth.!Bad: birth year.!Better: birthday.Ex: phone numbers.!Bad: first three digits.!Better: last three digits.573 = California, 574 = Alaskaassigned in chronological order within a given geographic regionthoroughly researched problem,still problematic in practical applications7Hash Codes and Hash FunctionsJava convention: all classes implement hashcode()hashcode() returns a 32-bit int (between -2147483648 and 2147483647)Hash function. An int between 0 and M-1 (for use as an array index)First try:Bug. Don't use (code % M) as array index 1-in-a billion bug. Don't use (Math.abs(code) % M) as array index. OK. Safe to use ((code & 0x7fffffff) % M) as array index.String s = "call";int code = s.hashCode();int hash = code % M; 7121 8191hex literal30459828Hash Codes and Hash FunctionsJava convention: all classes implement hashcode()hashcode() returns a 32-bit int (between -2147483648 and 2147483647)Hash function. An int between 0 and M-1 (for use as an array index)First try:Bug. Don't use (code % M) as array index [could be negative].1-in-a billion bug. Don't use (Math.abs(code) % M) as array index. [code could be 2147483648]OK. Safe to use ((code & 0x7fffffff) % M) as array index.String s = "call";int code = s.hashCode();int hash = code % M; 7121 8191hex literal30459829Implementing hashCode() in JavaTheoretical advantages of hashCode() convention!Ensures hashing can be used for every type of object!Allows expert implementations suited to each typeAPI for hashCode().!Return an int.!If x.equals(y) then x and y must have the same hash code.!Repeated calls to x.hashCode() must return the same value.Practical realities of hashCode() convention!Cost is an important consideration!True randomness is hard to achieveDefault implementation. Memory address of x.Customized implementations. String, URL, Integer, Date.User-defined implementations. Tricky to get right, black art.inherited from Object10A typical typeAssumption when using hashing in Java: Key type has reasonable implementation of hashCode() and equals()Ex. Phone numbers: (609) 867-5309.Problem: Need a theorem for each type of data to ensure reliability.sufficiently random?exchangeextensionpublic final class PhoneNumber{ private final int area, exch, ext; public PhoneNumber(int area, int exch, int ext) { this.area = area; this.exch = exch; this.ext = ext; } public boolean equals(Object y) { // as before } public int hashCode() { return 10007 * (area + 1009 * exch) + ext; }}11A decent hash code designJava 1.5 string library [see also Program 14.2 in Algs in Java].!Equivalent to h = 31L-1·s0 + … + 312·sL-3 + 31·sL-2 + sL-1.!Horner's method to hash string of length L: L multiplies/addsEx. Provably random? Well, no. String s = "call";int code = s.hashCode();3045982 = 99·313 + 97·312 + 108·311 + 108·310 = 108 + 31·(108 + 31·(99 + 31·(97)))ith character of sUnicodechar… …'a' 97'b' 98'c' 99… …public int hashCode(){ int hash = 0; for (int i = 0; i < length(); i++) hash = s[i] + (31 * hash); return hash;}12A poor hash code designJava 1.1 string library.!For long strings: only examines 8-9 evenly spaced characters.!Saves time in performing arithmetic…but great potential for bad collision patterns.Basic rule: need to use the whole key.http://www.cs.princeton.edu/introcs/13loop/Hello.javahttp://www.cs.princeton.edu/introcs/13loop/Hello.classhttp://www.cs.princeton.edu/introcs/13loop/Hello.htmlhttp://www.cs.princeton.edu/introcs/13loop/index.htmlhttp://www.cs.princeton.edu/introcs/12type/index.htmlpublic int hashCode(){ int hash = 0; int skip = Math.max(1, length() / 8); for (int i = 0; i < length(); i += skip) hash = (37 * hash) + s[i]; return hash;}Digression: using a hash function for data miningUse content to characterize documents.Applications!Search documents on the web for documents similar to a given one.!Determine whether a new document belongs in one set or anotherApproach!Fix order k and dimension d!Compute hashcode() % d for allk-grams in the document!Result: d-dimensional vectorprofile of each document!To compare documents:Consider angle ! separating vectors–cos ! close to 0: not similar–cos ! close to 1: similarEffective for literature, genomes, Java code, art, music, data, video13cos ! = a ! b / ⎜a⎜⎜b⎜ab!Digression: using a hash function for data
View Full Document