EWU CSCD 300 - Hashing Background - D2636091

Home> Schools> Eastern Washington University> Computer Science (CSCD) > CSCD 300> Hashing Background

EWU CSCD 300 - Hashing Background

School name Eastern Washington University

Pages 21

Download Save

Unformatted text preview:

CSCD 326 Data Structures I HashingHashing BackgroundHashing - Basic IdeasSimple Example of hashingGoals for Hashing FunctionsHash Function Construction MethodsHash Function Construction Methods (2)Hash Function Construction Methods (3)Hash Function Construction Methods (4)Hash Function Construction Methods (5)Hash Function Construction Methods (6)Hash Function Construction Methods (7)Linked Collision ProcessingLinked Collision Processing (2)Linked Collision Processing (3)Linked Collision Processing (4)Linear Collision ProcessingSlide 18Analysis of Linear ProbingRehashingQuadratic Probing1CSCD 326 Data Structures IHashing2Hashing BackgroundGoal: provide a constant time complexity method of searching for stored dataThe best traditional searching time complexity available is O(log2n) for binary searchBinary search requires that data be stored in sorted order.Hashing approach to data storage and retrieval:Contiguous memory is not used and memory is sacrificed for speed.Often used for symbol table management in compilers, assemblers, and linker/loaders.3Hashing - Basic IdeasData storage - hashing relies primarily on arrays for data storage but not on contiguous storage within the arrayData storage/retrieval method: use a math function which, when given the key or data value to be stored, returns an array index in which to store the value.This is referred to as a "hash function."The same function will be used to retrieve the value later on.4Simple Example of hashingEmployee data is to be stored using employee number as a key.Employee numbers are unique and run from 10,000 to 19,999.Storage: use an array of size 10,000.Hash function: Emp. Number - 10000 provides a unique index into the array and that array location is used to store/retrieve information for this employee.Problem: key values (in other situations) are often not unique or do not fall into a range which allows a reasonable size array.5Goals for Hashing FunctionsThe same key value (value used for insertion) should always return the same index.If it does not - data can't be retrieved later.As much as possible - different key values should not hash to the same index.This is done by mixing things up with the hash function so that common patterns in key values do not hash to the same locations.This can never be prevented however - so collision handling becomes an issue.6Hash Function Construction MethodsUsing numeric ASCII values of characters:Example key: JUNKAdd ASCII values of characters (74 + 85 + 78 + 75) to produce a single integer (312).This may suffice but the integer produced is not unique to "JUNK".7Hash Function Construction Methods (2)Concatenation of ASCII values:Represent A - Z as integers 0-25 and concatenate these values.So JUNK becomes: 20 13 10901001101000110101010= 3158182521021532768=3231024=32232=321and so the concatenation can be expressed as: 9 * 323 + 20 * 322 + 13 * 321 + 10 = 3158188Hash Function Construction Methods (3)Using the mod operator:Allows reduction of large values into the range of actual hash table indices.in the example above if the table is an array of size 10000 --315818 % 10000 = 5818.Note here that the mod operator simply removes the first two digits and this makes the hashed value less unique to the string used to generate it.9Hash Function Construction Methods (4)Using the mod operator:Problems with use of mod operator - choice of exact table size is very important - if there are a large number of common factors - many collisions can be generated.e.g. table size 15 Key values 10, 20, 30, 40, 50, 60, 70 - here 7 values hash to three indices - 30,60 to 0 - 20,50 to 5 and 10,40,70 to 10Solution - use an array size which is prime - thus it can't have any common factors with key values.10Hash Function Construction Methods (5)Using pseudo-random number generators:Given the same starting seed pseudo-random number generators always produce the same sequence of values.Here use a number generated from the key string as a seed and use the first resulting pseudo-random sequence value to generate the hash table index.11Hash Function Construction Methods (6)FoldingScrambles numeric values to remove the effects of recurring patterns- e.g. add the numeric values.Boundary FoldingBreaks numbers into segments and adds digits in the segments.e.g. social security numbers: 534-65-9234 - breaks at dashes - hash value is 534 + 65 + 9234Fan Folding Like boundary but reverses the digits in every other value.12Hash Function Construction Methods (7)Digit or character extractionAnother way to scramble similar patterns in multiple keys - can be used in two ways:1) Simply remove characters likely to be similar in many keys (or use dissimilar characters).2) Mid-Square techniqueRepresent key as a number.Square the number.Extract from the middle of the squared value enough bits to form an array index.13Linked Collision ProcessingLinked method of collision overflow handling divides memory into two parts:One part for primary storage (the hash table itself)A separate secondary part for collision overflow (may be either dynamically allocated or a separate fixed allocation area).14Linked Collision Processing (2)Linked collision overflow handling:Assume the hash table is composed of an array of objects which contain an instance variable which is a reference to an object of the same type.On collision: dynamically allocate a new node and place data into it. link the new node through the reference.overflowed items are stored in a linked list off the original table item.15Linked Collision Processing (3)Primary Memory(Hash Table)Secondary Memory(Overflow)16Linked Collision Processing (4)Search time with linked overflowIf there have been many collisions - the search is no longer constant time complexity since a sequential search must be done through the linked list.Thus the time complexity becomes O(n) where n is the number of collisions.17Linear Collision ProcessingAlso called Linear Probing - no primary and secondary memory - original array holds both.When a collision occurs: Start at hashed location (site of first collision)Proceed sequentially through the array until available storage is found - store at this locationThe array must be treated circularly since a probe could reach the end and need to start again at beginning.18Linear Collision ProcessingProblem with linear probing: clustering If the hash function produces one value more than others - parts of the table will quickly fill up while others are empty.Clustering

View Full Document


School:
Email:
New Password:
Confirm Password:

EWU CSCD 300 - Hashing Background

Sign up for free to view:

Please select your school