Searching, Maps,Tries (hashing)From Google to MapsInterface at work: MapDemo.javaAccessing values in a map (e.g., print)External IteratorHashing: Log (10100) is a big numberHashing detailsChaining with hashingHashing problemsWhat about hash functionsTrie: efficient search words/suffixesTrie picture and code (see Trie.java)Guy L. Steele, Jr.CPS 10011.1Searching, Maps,Tries (hashing)Searching is a fundamentally important operationWe want to search quickly, very very quicklyConsider searching using Google, ACES, issues?In general we want to search in a collection for a keyWe've searched using trees and arraysTree implementation was quick: O(log n) worst/average?Arrays: access is O(1), search is slowerIf we compare keys, log n is best for searching n elementsLower bound is (log n), provableHashing is O(1) on average, not a contradiction, why?Tries are O(1) worst-case!! (ignoring length of key)CPS 10011.2From Google to MapsIf we wanted to write a search engine we’d need to access lots of pages and keep lots of dataGiven a word, on what pages does it appear?This is a map of words->web pagesIn general a map associates a key with a valueLook up the key in the map, get the valueGoogle: key is word/words, value is list of web pagesAnagram: key is string, value is words that are anagramsInterface issuesLookup a key, return boolean: in map or value: associated with the key (what if key not in map?)Insert a key/value pair into the mapCPS 10011.3Interface at work: MapDemo.javaKey is a string, Value is # occurrences Interface in code below shows how Map class works while (scanner.hasNext()) { String s = (String) scanner.next();. Counter c = (Counter) map.get(s); if (c != null) c.increment(); else m.put(s, new Counter()); }What clues are there for prototype of map.get and map.put?What if a key is not in map, what value returned?What kind of objects can be put in a map?CPS 10011.4Accessing values in a map (e.g., print)Access every key in the map, then get the corresponding valueGet an iterator of the set of keys: keySet().iterator()For each key returned by this iterator call map.get(key) …Get an iterator over (key,value) pairs, there's a nested class called Map.Entry that the iterator returns, accessing the key and the value separately is then possibleTo see all the pairs use entrySet().iterator()CPS 10011.5External IteratorThe Iterator interface access elementsSource of iterator makes a difference: cast required? Iterator it = map.keySet().iterator(); while (it.hasHasNext()){ Object value = map.get(it.next()); } Iterator it2 = map.entrySet().iterator(); while (it2.hasNext()){ Map.Entry me = (Map.Entry) it.next(); Object value = me.getValue(); }CPS 10011.6Hashing: Log (10100) is a big numberComparison based searches are too slow for lots of dataHow many comparisons needed for a billion elements?What if one billion web-pages indexed?Hashing is a search method: average case O(1) searchWorst case is very bad, but in practice hashing is goodAssociate a number with every key, use the number to store the key•Like catalog in library, given book title, find the bookA hash function generates the number from the keyGoal: Efficient to calculateGoal: Distributes keys evenly in hash tableCPS 10011.7Hashing detailsThere will be collisions, two keys will hash to the same valueWe must handle collisions, still have efficient searchWhat about birthday “paradox”: using birthday as hash function, will there be collisions in a room of 25 people? Several ways to handle collisions, in general array/vector usedLinear probing, look in next spot if not found•Hash to index h, try h+1, h+2, …, wrap at end•Clustering problems, deletion problems, growing problemsQuadratic probing•Hash to index h, try h+12, h+22 , h+32 , …, wrap at end•Fewer clustering problemsDouble hashing•Hash to index h, with another hash function to j•Try h, h+j, h+2j, …0 1 2 3 n-1CPS 10011.8Chaining with hashingWith n buckets each bucket stores linked listCompute hash value h, look up key in linked list table[h]Hopefully linked lists are short, searching is fastUnsuccessful searches often faster than successful•Empty linked lists searched more quickly than non-emptyPotential problems?Hash table detailsSize of hash table should be a prime numberKeep load factor small: number of keys/size of tableOn average, with reasonable load factor, search is O(1)What if load factor gets too high? Rehash or other methodCPS 10011.9Hashing problemsLinear probing, hash(x) = x, (mod tablesize)Insert 24, 12, 45, 14, delete 24, insert 23 (where?)Same numbers, use quadratic probing (clustering better?)What about chaining, what happens?0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10241245 1424124514CPS 10011.10What about hash functionsHashing often done on strings, consider two alternativespublic static int hash(String s){ int k, total = 0; for(k=0; k < s.length(); k++){ total += s.charAt(k); } return total;}Consider total += (k+1)*s.charAt(k), why might this be better?Other functions used, always mod result by table sizeWhat about hashing other objects?Need conversion of key to index, not always simpleEver object contains hashCode()!CPS 10011.11Trie: efficient search words/suffixesA trie (from retrieval, but pronounced “try”) supportsInsertion: put string into trie (delete and look up)These operations are O(size of string) regardless of how many strings are stored in the trie! Guaranteed!In some ways a trie is like a 128 (or 26 or alphabet-size) tree, one branch/edge for each character/letterNode stores branches to other nodesNode stores whether it ends the string from root to itExtremely useful in DNA/string processingVery useful for matching suffixes: suffix treeCPS 10011.12Trie picture and code (see Trie.java)To add stringStart at root, for each char create node as needed, go down tree, mark last nodeTo find stringStart at root, follow links•If null, not foundCheck word flag at endTo print all nodesVisit every node, build string as nodes traversedWhat about union and intersection?acprnsaraha os tc dIndicates word ends hereCPS 10011.13Guy L. Steele, Jr. Co-invented/developed Scheme,
View Full Document