DOC PREVIEW
MIT 6 006 - Lecture 6: Hashing II: Table Doubling, Rolling Hash, Karp-Rabin

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Lecture 6 Hashing II: Table Doubling, Rolling Hash, Karp-Rabin 6.006 Fall 2009Lecture 6: Hashing II: Table Doubling, RollingHash, Karp-RabinLecture Overview• Table Resizing• Amortization• DNA Comparison, Karp-Rabin Rolling HashReadingsCLRS Chapter 17 and 32.2.Recall:• Hashing with Chaining:U universe of all possible keys h(k1) h(k3) h(k2) = h(k4) = h(k5) set of actual keys (not known in advance) item3 item1 item2 item4 1 0 m-1 item5 Table |K| = nKm slots, m ≈ n h: hash function h expected length α = n/mFigure 1: Chaining in a Hash Table1Lecture 6 Hashing II: Table Doubling, Rolling Hash, Karp-Rabin 6.006 Fall 2009• Simple Uniform Hashing Assumption (Silently used in all of this lecture)Each key is equally likely to be hashed to any slot of table, independent of where other keysare hashed.⇒ average ] keys per slot is α = n/m⇒ expected time to search, insert, delete = O(1 + α).Caution: The above bound assumes that the application of the hash function h(·) takesO(1) time. Sometimes this is not the case, e.g. if the keys are strings. Then the keys needto be processed into numbers and then hashed to {0, 1, . . . , m −1}. Then the above boundis scaled by the time needed for applying h.• Good Hash Functions:– Division Method: h(k) = k mod mGood Practice: m is a prime number & not close to a power of 2 or 10.– Multiplication Method: h(k) = [(a · k) mod 2w]  (w − r), where-  denotes the “shift right” operator,- 2ris the table size (= m),- w the bit-length of the machine words,- and a is chosen to be an odd integer between 2(w−1)and 2w.Good Practice: a not too close to 2(w−1)or 2w.wkax}r}w-rkeepignoreignore≡+product as sumlots of mixingFigure 2: Multiplication Method2Lecture 6 Hashing II: Table Doubling, Rolling Hash, Karp-Rabin 6.006 Fall 2009How Large should Table be?• want m = Θ(n) at all times• Why?m too small =⇒ slow (recall operations take expected time Θ(1 +α), where α =nm);m too big =⇒ wastefulChallenge: Don’t know how large n will get at creation.Idea: Start small (constant) and grow (or shrink) as necessary.Table Resizing with Rehashing:To change m build new hash table from scratch:Allocate table of size m;For each item in old table:insert into new table=⇒ Θ(n + m) time = Θ(n), if m = Θ(n)How fast to grow?When n reaches m, say• m + = 1?=⇒ rebuild every step=⇒ n inserts cost Θ(1 + 2 + ···+ n) = Θ(n2)• m ∗ = 2? m = Θ(n) still=⇒ rebuild at insertion 2i, pay 2i+1(see Figure 3)=⇒ n inserts cost Θ(1 + 2 + 4 + 8 + ··· + n) where n is really the next power of 2= Θ(n)• a few inserts cost linear time, but Θ(1) “on average”.Amortized AnalysisThis is a common technique in data structures - like paying rent: $ 1500/month ≈ $ 50/day• if a sequence of n operations has total cost ≤ n · T (n), then each operation hasamortized cost T (n)• “T (n) amortized” roughly means T (n) “on average”, but averaged over all ops.• e.g. inserting into a hash table (with doubling) takes O(1) amortized time.3Lecture 6 Hashing II: Table Doubling, Rolling Hash, Karp-Rabin 6.006 Fall 20091 2 4 8 16 8 4 2 32 16 32 item count Figure 3: Resizing by Doubling. Costly Inserts: Item # 1, 2, 4, 8, 16, . . .. Resizing Penalty is constant onaverage.Back to Hashing:Maintain m = Θ(n) so also support search in O(1) expected time assuming simple uniformhashing.Delete:Also O(1) expected time• space can get big with respect to n e.g. n× insert, n× delete• solution: when n decreases to m/4, shrink table to m/2=⇒ O(1) amortized cost for both insert and delete ——analysis is trickier; (see CLRS 17.4).4Lecture 6 Hashing II: Table Doubling, Rolling Hash, Karp-Rabin 6.006 Fall 2009Rolling Hash: Human vs ChimpGiven two strings S and T , find the longest common substring of two strings.Naive algorithm: Θ(n4).Naive + binary search: Θ(n3log n).Winner algorithm from last lecture runs in time Θ(n2log n), using hash tables:For all possible lengths `:• (Step 1) Insert all substrings s of S of length ` into a hash table using some hashfunction h;• (Step 2) For all substrings t of T of length `, check if position h(t) of the dictionaryis occupied. If yes (say by substring s of string S), compare s and t. If strings agreereturn s and t and exit.Analysis:Outer Loop: using binary search on the length of the longest common substring, onlyO(log n) iterations are needed!For every `:Step 1:• there are n − ` + 1 substrings s of S of length `;• need to convert each of them into an integer; how?think of s := S[i : i + `] as a multi-digit number base b, where b is larger thanthe alphabet sizes 7→ S[i] · b`−1+ S[i + 1] · b`−2+ . . . + S[i + ` − 1]• use hash function h(s) = s mod m• How to compute h(s) without writing down the above expression?Mod arithmetic magic!Claim: For all integers a, b:a + b mod m = ((a mod m) + (b mod m)) mod ma · b mod m = ((a mod m) · (b mod m)) mod m• hence, computation of hash takes time O(`) for each substring s.=⇒ total time for Step 1: O(` · (n − ` + 1)) = O(n2).5Lecture 6 Hashing II: Table Doubling, Rolling Hash, Karp-Rabin 6.006 Fall 2009Step 2:• there are n − ` + 1 substrings t of T of length `;• every hash operation takes O(`), plus another potential O(`) forcomparing strings if hashes match.• =⇒ total time for Step 1: O(` · (n − ` + 1)) = O(n2).OVERALL time is O(n2log n).OUR GOAL: Drop time down to O(n2) initialization time + O(n log n) execution time. Bymaking both Step 1 and Step 2 take O(n).Idea: use h1 := h(S[i : i + `]) to compute h2 := h(S[i + 1 : i + ` + 1]) in O(1) time.How?Go fromh1 = S[i] · b`−1+ S[i + 1] · b`−2+ . . . + S[i + ` − 1] mod mtoh2 = S[i + 1] · b`−1+ S[i + 2] · b`−2+ . . . + S[i + `] mod m.without recomputing h2 from scratch.Magic Again — Called Rolling Hash, and introduced by Karp-Rabin:h2 =S[i + 1] · b`−1+ S[i + 2] · b`−2+ . . . + S[i + `]mod m ==S[i] · b`−1+ S[i + 1] · b`−2+ . . . + S[i + ` − 1]b + S[i + `] − S[i] · b`mod m ==h1 · b + S[i + `] − S[i] · (b`mod m)mod m.Now Step 1 takes time O(n) overall. What about Step 2? Also, time O(n), exceptif there are too many spurious matches of hash values, which do not actually result inmatching substrings: Each comparison takes O(`) and if many unsuccessful comparisonsare made this could increase the time for Step 2 to O(n`).But: For each substring of T the probability of a false-positive hash-value match


View Full Document

MIT 6 006 - Lecture 6: Hashing II: Table Doubling, Rolling Hash, Karp-Rabin

Documents in this Course
Quiz 1

Quiz 1

7 pages

Quiz 2

Quiz 2

12 pages

Quiz 2

Quiz 2

9 pages

Quiz 1

Quiz 1

10 pages

Quiz 2

Quiz 2

11 pages

Quiz 1

Quiz 1

12 pages

Graphs

Graphs

27 pages

Load more
Download Lecture 6: Hashing II: Table Doubling, Rolling Hash, Karp-Rabin
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 6: Hashing II: Table Doubling, Rolling Hash, Karp-Rabin and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 6: Hashing II: Table Doubling, Rolling Hash, Karp-Rabin 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?