DOC PREVIEW
CMU CS 15122 - Lecture Notes on Hash Tables

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

IntroductionAssociative ArraysChainsHashingSeparate ChainingAverage Case AnalysisRandomnessImplementationLecture Notes onHash Tables15-122: Principles of Imperative ComputationFrank PfenningLecture 11September 28, 20101 IntroductionIn this lecture we introduce so-called associative arrays, that is, data struc-tures that are similar to arrays but are not indexed by integers, but otherforms of data such as strings. One popular data structures for the imple-mentation of associative arrays are hash tables. To analyze the asymptoticefficiency of hash tables we have to explore a new point of view, that ofaverage case complexity. Another computational thinking concept that wesee for the first time is randomness. In order for hash tables to work effi-ciently in practice we need hash functions whose behavior is predictable(deterministic) but has some aspects of randomness.2 Associative ArraysArrays can be seen as a mapping, associating with every integer in a giveninterval some data item. It is finitary, because its domain, and thereforealso its range, is finite. There are many situations when we want to indexelements differently than just by integers. Common examples are strings(for dictionaries, phone books, menus, data base records), or structs (fordates, or names together with other identifying information). They are socommon that they are primitive in some languages such as PHP, Python,or Perl and perhaps account for some of the popularity of these languages.In many applications, associative arrays are implemented as hash tablesbecause of their performance characteristics. We will develop them incre-mentally to understand the motivation underlying their design.LECTURE NOTES SEPTEMBER 28, 2010Hash Tables L11.23 ChainsIn many applications requiring associative arrays, we are storing complexdata values and want to access them by a key which is derived from thedata. A typical example of keys are strings, which are appropriate for manyscenarios. For example, the key might be a student id and the data entrymight be a collection of grades, perhaps another associative array wherethe key is the name of assignment or exam and the data is a score. Wemake the assumption that keys are unique in the sense that in an associativearray there is at most one data item associated with a given key. In someapplications we may need to complicate the structure of keys to achieve thisuniqueness. This is consistent with ordinary arrays, which have a uniquevalue for every valid index.A first idea to explore is to implement the associative array as a linkedlist, called a chain. If we have a key k and look for it in the chain, we justtraverse it, compute the intrisic key for each data entry, and compare itwith k. If they are equal, we have found our entry, if not we continue thesearch. If we reach the end of the chain and do not find an entry with key k,then no entry with the given key exists. If we keep the chain unsorted thisgives us O(n) worst case complexity for finding a key in a chain of lengthn, assuming that computing and comparing keys is constant time.Given what we have seen so far in our search data structures, this seemsvery poor behavior, but if we know our data collections will always besmall, it may in fact be reasonable on occasion.Can we do better? One idea goes back to binary search. If keys areordered we may be able to arrange the elements in an array or in the formof a tree and then cut the search space roughly in half every time we makea comparison. This approch will occupy us for a few lectures after the firstmidterm. Designing such data structures is a rich and interesting subject.The best we can hope for with this approach is O(log(n)), where n is thenumber of entries. We have seen that this function grows very slowly, sothis is quite a practical approach.Nevertheless, the challenge arises if we can do better than O(log(n)),say, constant time O(1) to find an entry with a given key. We know thatit can done be for arrays, indexed by integers, which allow constant-timeaccess. Can we also do it, for example, for strings?LECTURE NOTES SEPTEMBER 28, 2010Hash Tables L11.34 HashingThe first idea behind hash tables is to exploit the efficiency of arrays. So:to map a key to an entry, we first map a key to an integer and then use theinteger to index an array A. The first map is called a hash function. We writeit as hash( ). Given a key k, our access could then simply be A[hash(k)].There is an immediate problems with this approach: there are only 231positive integers, so we would need a huge array, negating any possibleperformance advantages. But even if we were willing to allocate such ahuge array, there are many more strings than int’s so there cannot be anyhash function that always gives us different int’s for different strings.The solution is to allocate an array of smaller size, say m, and then lookup the result of the hash function modulo m, for example, A[hash(k)%m].This creates a new problem: it is inevitable that multiple strings will mapto the same array index. For example, if the array has size m then if wehave more then m elements, at least two must map to the same index. Inpractice, this will happen much sooner that this.If two hash functions map a key to the same integer value (modulo m),we say we have a collision. In general, we would like to avoid collisions,because some additional operations will be required to deal with them,slowing down operations and taking more space. We analyze the cost ofcollisions more below.5 Separate ChainingHow do we deal with collisions of hash values? The simplest is a techniquecalled separate chaining. Assume we have hash(k1)%m = i = hash(k2)%m,where k1and k2are the distinct keys for two data entries e1and e2we wantto store in the table. In this case we just arrange e1and e2into a chain(implemented as a linked list) and store this list in A[i].In general, each element A[i] in the array will either be NULL or a chain ofentries. All of these must have the same hash value for their key (modulom), namely i. As an exercise, you might consider other data structureshere instead of chains and weigh their merits: how about sorted lists? Orqueues? Or doubly-linked lists? Or another hash table?We stick with chains because they are simple and fast, provided thechains don’t become too long. This technique is called separate chainingbecause the chains are stored seperately, not directly in the array. Anothertechnique, which we do not discuss, is linear probing


View Full Document

CMU CS 15122 - Lecture Notes on Hash Tables

Download Lecture Notes on Hash Tables
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes on Hash Tables and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes on Hash Tables 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?