6.006-Introduction to6.006Introduction to AlgorithmsLecture 7PfM li K lliProf.Manolis KellisCLRS: 11.4, 17.Unit #2 – Genomes, Hashing, and Dictionaries2Unit #2: HashingLtT G Diti i Hhi•Last Tues: Genomes, Dictionaries, Hashing– Intro, basic operations, collisions and chainingSi l if h hi i–Simple uniform hashing assumption• Last Thur: Faster hashing, hash functions–Hash functions in practice: div/mult/python– Faster hashing: Rolling Hash: O(n2)O(nlgn)– Faster comparison: Signatures: mismatch time• Today: Space issues– Dynamic resizing and amortized analysis– Open addressing, deletions, and probing– Advanced topics: universal hashing, fingerprints3Today: Hashing IΙI: Space issuesRHhfthii SUHA lli i tRev: Hash functs, chaining, SUHA, rolling, signaturesDynamic dictionaries: Resizing hash tablesWhen to resize: insertions, deletionsResizing operations, amortized analysisOpen addressing: Doing away w/ linked listsOperations: insertion, deletionOperations: insertion, deletionProbing: linear probing, double hashingPerformance analysis: UHA, open vs. chainingPerformance analysis: UHA, open vs. chainingAdvanced topics: Randomized algorithms (shht!)Ui lh hi f th hiUniversal hashing, perfect hashingFingerprinting, file signatures, false positivesRemember Hashing I and II•Hashing and hash functions•Hashing and hash functions Humongous universe of keys itty bitty little space• Hash table as dictionaryy Insert/Search/Delete• Collisions by chainingBuild a linked list in each bucketBuild a linked list in each bucket Operation time is length of list• Simple Uniform Hashing Assumption Every item to uniform random bucket n items in size m table average length n/m = α•Speeding up hashing•Speeding up hashing Rolling Hash: fast sequence of hash’s Signatures: fast comparison, avoid frequent mismatchesCi•Comparing genomes O(n4)ObinsearchL(n3lgn)Ohash(n2lgn)Oroll/sign(nlgn)Today: Hashing IΙI: Space issuesRHhfthii SUHA lli i tRev: Hash functs, chaining, SUHA, rolling, signatures Dynamic dictionaries: Resizing hash tablesWhen to resize: insertions, deletionsResizing operations, amortized analysisOpen addressing: Doing away w/ linked listsOperations: insertion, deletionOperations: insertion, deletionProbing: linear probing, double hashingPerformance analysis: UHA, open vs. chainingPerformance analysis: UHA, open vs. chainingAdvanced topics: Randomized algorithms (shht!)Ui lh hi f th hiUniversal hashing, perfect hashingFingerprinting, file signatures, false positivesDynamic DictionariesIbi liii dll•In substring application, inserted all at once then scanned• More generally, arbitrary sequence of insert, delete, find• How do we know how big the table will get?•What if we guess wrong?What if we guess wrong?too small load high, operations too slowtoo largehigh initialization cost, consumes space,too large high initialization cost, consumes space, •Want m=(n) at all timespotentially more cache-misses•Want m=(n) at all timesSolution: Resize when neededSolution: Resize when needed•Start table at small constant sizeStart table at small constant size• When table too full, make it biggerhbl ki ll• When table too empty, make it smaller• How? Build a whole new hash table and insert items Pick new hash ‘seed’, recompute all hashesp Recreate new linked lists Time spent to rebuild: p(new-size) + #hashes x (HashTime)When to resize?When to resize?• Approach 1: whenever n > m, rebuild table to new size Sequence of n insertsE h i t b ilda factor of (HashTime) is suppressed hereEach increases n past m, causes rebuild Total work: (1 + 2 + … + n) = (n2)•Approach 2:Whenevern≥ 2m rebuild table to new•Approach 2: Whenever n≥ 2m, rebuild table to new sizeCostly inserts:insert 2ifor alli:Costly inserts:insert 2for all i: These cost: (1 + 2 + 4 + … + n) = (n) All other inserts take O(1) time – why? Inserting n items takes O(n) time Keeps m a power of 2 --- good for modAmortized AnalysisAmortized Analysis•If a sequence of n operations takes time T, thenIf a sequence of n operations takes time T, then each operation has amortized cost T/n Like amortizing a loan: payment per monthgpyp• Rebuilding when n ≥ 2m some ops are very slow (n) for insertion that causes last resize• But on average, fastg, O(1) amortized cost per operation•Often, only care about total runtimeOften, only care about total runtime So averaging is fineInsertions+Deletions?b ild bl i h ?• Rebuild table to new size when n < m? Same as bad insert: O(n2) workR b ild h</2?•Rebuild when n<m/2? Makes a sequence of deletes fastWhat about an arbitrary sequence of inserts/deletes?What about an arbitrary sequence of inserts/deletes?• Suppose we have just rebuilt: m=n•Next rebuild a growat leastmmore inserts areNext rebuild a grow at least mmore inserts are needed before growing table Amortized cost O(2m / m)) = O(1)• Next rebuild a shrink at least m/2 more deletes are needed before shrinkingCost to rebuildPaid after m insertionsneeded before shrinking Amortized cost O(m/2 / (m/2)) = O(1)Cost to rebuildPaid after m/2 deletionsPutting the two togetherPutting the two together•AlgorithmAlgorithm Keep m a power of 2 (good for mod)Grow (double m) whenn≥ mGrow (double m) when n≥ m Shrink (halve m) when n ≤ m/4•AnalysisAnalysis Just after rebuild: n=m/2Next rebuild a grow at least m/2 more insertsg• Amortized cost O(2m / (m/2)) = O(1) Next rebuild a shrink at least m/4 more deletes• Amortized cost O(m/2 / (m/4)) = O(1)SummarySummary•Arbitrary sequence of insert/delete/findArbitrary sequence of insert/delete/find• O(1) amortized time per operationToday: Hashing IΙI: Space issuesRHhfthii SUHA lli i tRev: Hash functs, chaining, SUHA, rolling, signatures Dynamic dictionaries: Resizing hash tables When to resize: insertions, deletions Resizing operations, amortized analysis Open addressing: Doing away w/ linked listsOperations: insertion, deletionOperations: insertion, deletionProbing: linear probing, double hashingPerformance analysis: UHA, open vs. chainingPerformance analysis: UHA, open vs. chainingAdvanced topics: Randomized algorithms (shht!)Ui lh hi f th hiUniversal
View Full Document