MIT OpenCourseWare http ocw mit edu 6 006 Introduction to Algorithms Spring 2008 For information about citing these materials or our Terms of Use visit http ocw mit edu terms Lecture 5 Hashing I Chaining Hash Functions 6 006 Spring 2008 Lecture 5 Hashing I Chaining Hash Functions Lecture Overview Dictionaries and Python Motivation Hash functions Chaining Simple uniform hashing Good hash functions Readings CLRS Chapter 11 1 11 2 11 3 Dictionary Problem Abstract Data Type ADT maintains a set of items each with a key subject to insert item add item to set delete item remove item from set search key return item with key if it exists assume items have distinct keys or that inserting new one clobbers old balanced BSTs solve in O lg n time per op in addition to inexact searches like nextlargest goal O 1 time per operation Python Dictionaries Items are key value pairs e g d algorithms 5 cool 42 d items d cool d 42 cool in d 42 in d algorithms 5 cool 5 42 KeyError True False Python set is really dict where items are keys 1 Lecture 5 Hashing I Chaining Hash Functions 6 006 Spring 2008 Motivation Document Distance already used in def count frequency word list D for word in word list if word in D D word 1 else D word 1 new docdist7 uses dictionaries instead of sorting def inner product D1 D2 sum for key in D1 if key in D2 sum D1 key D2 key optimal n document distance assuming dictionary ops take O 1 time PS2 How close is chimp DNA to human DNA Longest common substring of two strings e g ALGORITHM vs ARITHMETIC Dictionaries help speed algorithms e g put all substrings into set looking for duplicates n2 operations 2 Lecture 5 Hashing I Chaining Hash Functions 6 006 Spring 2008 How do we solve the dictionary problem A simple approach would be a direct access table This means items would need to be stored in an array indexed by key 1 2 key item key item key item Figure 1 Direct access table Problems 1 keys must be nonnegative integers or using two arrays integers 2 large key range large space e g one key of 2256 is bad news 2 Solutions Solution 1 map key space to integers In Python hash object where object is a number string tuple etc or object implementing hash Misnomer should be called prehash Ideally x y hash x hash y Python applies some heuristics e g hash B 64 hash C Object s key should not change while in table else cannot nd it anymore No mutable objects like lists 3 Lecture 5 Hashing I Chaining Hash Functions 6 006 Spring 2008 Solution 2 hashing verb from hache hatchet Germanic Reduce universe U of all keys say integers down to reasonable size m for table idea m n n k k keys in dictionary hash function h U 1 m 1 T k1 U k k k k k 1 k3 1 h k 1 1 2 4 3 k2 m 1 Figure 2 Mapping keys to a table two keys ki kj K collide if h ki h kj How do we deal with collisions There are two ways 1 Chaining TODAY 2 Open addressing NEXT LECTURE 4 Lecture 5 Hashing I Chaining Hash Functions 6 006 Spring 2008 Chaining Linked list of colliding elements in each slot of table U k k k k k1 2 4 k1 k4 k2 3 k3 h k 1 h k 2 h k 4 Figure 3 Chaining in a Hash Table Search must go through whole list T h key Worst case all keys in k hash to same slot n per operation Simple Uniform Hashing an Assumption Each key is equally likely to be hashed to any slot of table independent of where other keys are hashed let n keys stored in table m slots in table load factor n m average keys per slot Expected performance of chaining assuming simple uniform hashing The performance is likely to be O 1 the 1 comes from applying the hash function and access slot whereas the comes from searching the list It is actually 1 even for successful search see CLRS Therefore the performance is O 1 if O 1 i e m n 5 Lecture 5 Hashing I Chaining Hash Functions 6 006 Spring 2008 Hash Functions Division Method h k k mod m k1 and k2 collide when k1 k2 mod m i e when m divides k1 k2 ne if keys you store are uniform random but if keys are x 2x 3x regularity and x and m have common divisor d then use only 1 d of table This is likely if m has a small divisor e g 2 if m 2r then only look at r bits of key Good Practice A good practice to avoid common regularities in keys is to make m a prime number that is not close to power of 2 or 10 Key Lesson It is inconvenient to nd a prime number division slow Multiplication Method h k a k mod 2w w r where m 2r and w bit machine words and a odd integer between 2 w 1 and 2w Good Practise a not too close to 2 w 1 or 2w Key Lesson Multiplication and bit extraction are faster than division w k a x r Figure 4 Multiplication Method 6
View Full Document
Unlocking...