DOC PREVIEW
CMU CS 15122 - Lecture Notes on Hash Tables

This preview shows page 1-2 out of 5 pages.

Save
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Lecture Notes on Hash Tables 15 122 Principles of Imperative Computation Frank Pfenning Lecture 11 September 28 2010 1 Introduction In this lecture we introduce so called associative arrays that is data structures that are similar to arrays but are not indexed by integers but other forms of data such as strings One popular data structures for the implementation of associative arrays are hash tables To analyze the asymptotic efficiency of hash tables we have to explore a new point of view that of average case complexity Another computational thinking concept that we see for the first time is randomness In order for hash tables to work efficiently in practice we need hash functions whose behavior is predictable deterministic but has some aspects of randomness 2 Associative Arrays Arrays can be seen as a mapping associating with every integer in a given interval some data item It is finitary because its domain and therefore also its range is finite There are many situations when we want to index elements differently than just by integers Common examples are strings for dictionaries phone books menus data base records or structs for dates or names together with other identifying information They are so common that they are primitive in some languages such as PHP Python or Perl and perhaps account for some of the popularity of these languages In many applications associative arrays are implemented as hash tables because of their performance characteristics We will develop them incrementally to understand the motivation underlying their design L ECTURE N OTES S EPTEMBER 28 2010 Hash Tables 3 L11 2 Chains In many applications requiring associative arrays we are storing complex data values and want to access them by a key which is derived from the data A typical example of keys are strings which are appropriate for many scenarios For example the key might be a student id and the data entry might be a collection of grades perhaps another associative array where the key is the name of assignment or exam and the data is a score We make the assumption that keys are unique in the sense that in an associative array there is at most one data item associated with a given key In some applications we may need to complicate the structure of keys to achieve this uniqueness This is consistent with ordinary arrays which have a unique value for every valid index A first idea to explore is to implement the associative array as a linked list called a chain If we have a key k and look for it in the chain we just traverse it compute the intrisic key for each data entry and compare it with k If they are equal we have found our entry if not we continue the search If we reach the end of the chain and do not find an entry with key k then no entry with the given key exists If we keep the chain unsorted this gives us O n worst case complexity for finding a key in a chain of length n assuming that computing and comparing keys is constant time Given what we have seen so far in our search data structures this seems very poor behavior but if we know our data collections will always be small it may in fact be reasonable on occasion Can we do better One idea goes back to binary search If keys are ordered we may be able to arrange the elements in an array or in the form of a tree and then cut the search space roughly in half every time we make a comparison This approch will occupy us for a few lectures after the first midterm Designing such data structures is a rich and interesting subject The best we can hope for with this approach is O log n where n is the number of entries We have seen that this function grows very slowly so this is quite a practical approach Nevertheless the challenge arises if we can do better than O log n say constant time O 1 to find an entry with a given key We know that it can done be for arrays indexed by integers which allow constant time access Can we also do it for example for strings L ECTURE N OTES S EPTEMBER 28 2010 Hash Tables 4 L11 3 Hashing The first idea behind hash tables is to exploit the efficiency of arrays So to map a key to an entry we first map a key to an integer and then use the integer to index an array A The first map is called a hash function We write it as hash Given a key k our access could then simply be A hash k There is an immediate problems with this approach there are only 231 positive integers so we would need a huge array negating any possible performance advantages But even if we were willing to allocate such a huge array there are many more strings than int s so there cannot be any hash function that always gives us different int s for different strings The solution is to allocate an array of smaller size say m and then look up the result of the hash function modulo m for example A hash k m This creates a new problem it is inevitable that multiple strings will map to the same array index For example if the array has size m then if we have more then m elements at least two must map to the same index In practice this will happen much sooner that this If two hash functions map a key to the same integer value modulo m we say we have a collision In general we would like to avoid collisions because some additional operations will be required to deal with them slowing down operations and taking more space We analyze the cost of collisions more below 5 Separate Chaining How do we deal with collisions of hash values The simplest is a technique called separate chaining Assume we have hash k1 m i hash k2 m where k1 and k2 are the distinct keys for two data entries e1 and e2 we want to store in the table In this case we just arrange e1 and e2 into a chain implemented as a linked list and store this list in A i In general each element A i in the array will either be NULL or a chain of entries All of these must have the same hash value for their key modulo m namely i As an exercise you might consider other data structures here instead of chains and weigh their merits how about sorted lists Or queues Or doubly linked lists Or another hash table We stick with chains because they are simple and fast provided the chains don t become too long This technique is called separate chaining because the chains are stored seperately not directly in the array Another technique which we do not discuss is linear probing where we continue by L ECTURE N OTES S EPTEMBER 28 2010 Hash Tables L11 4 searching …


View Full Document

CMU CS 15122 - Lecture Notes on Hash Tables

Loading Unlocking...
Login

Join to view Lecture Notes on Hash Tables and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes on Hash Tables and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?