IE172 – Algorithms in Systems EngineeringLecture 13Pietro BelottiDepartment of Industrial & Systems EngineeringLehigh UniversityFebruary 9, 2009(Midterm on February 23)Sorting a nd searching algorithms: what f or?◮Sorting and searching are usually done on a key◮The key is a unique identifier of an object th at contains avalue, i.e., more informatione.g. A medical record (value) is associated with a social securitynumber (key)e.g. An airport and all its flight information (value) is uniquelyidentified by a three-letter key◮When sorting a list of keys, we always assume that eachkey is only part of a pair(k, v); k is the key and v the value◮When searching in an array, RB tree, etc., we assume thatonce we’ve found the k ey we can also access the value◮There would be little use for searching/sort ing algorithmsotherwise...Fast search: is log n fast enough?You want to find a key in a sorted array A. You can◮Use binary search, which takes O(log n) time. Or◮Use an array, B, where B[i] is the position in A of key i.e.g. If A is 2 5 6 9 10 12 thenB is- - 0 - - 1 2 - - 3 4 - 5◮Indeed B[2]==0 and A[B[2]]==2This isdirect addressing, as fast as O(1). Problems:◮If A is 42,691,023 410,415,623 1,241,090,221 B ne eds acouple of GB of memor y for an array of 3 elements...◮If A contains strings, real numbers, or other types (classes),what index do we use?◮If A has duplicate keys, which one do we choose?Fast search in a n array◮What we actually need is a function f that tells us where anelement is in an array◮f takes an argument (410,415,623) and returns an inde x (1).◮We saw one such function f, with complexity O(log n): it isthe binary search procedure in binary search t rees◮Can we find a function with better p erformance (possiblynot requiring 1GB of memory?)◮Yes, with Hash Tables◮A hash table is a data structure in which we can “look up”an element efficiently.◮The expected time to search for an element in a h as table inO(1) (worst case time in Θ(n)).More on Hash◮A hash table is a table, i.e., an array, with m elements —usually m ≪ n as otherwise the array would be too large◮Given a key k, we don’t use k itself as the index into thearray1– rather, we have ahash function h, and we use h(k)as an index into the array.◮Given a “universe” K of keys w ith n = |K| (e.g. words in adictionary, social security numbers, Lehigh usernames. . . )◮h : K → {0, 1, . . . m −1}, so that h(k) gets mapped to aninteger bet ween 0 and m − 1 for every k ∈ K◮We say that k hashes to h(k)◮h converts a key (string, large numbers , real numbers) intoan integer number from 0 to m −1e.g. h(42,691,023) = 0; h(410,415,623) = 1; h(1,241,090,221) = 21We saw why two slides ago: k could be large, fractional, or a string.The thing that should not be (don e)◮Suppose m = |K|. Find “coffee” in a dictionary, |K|=100,000◮‘c’ is the third of our 26-letter alphabet. Words startingwith ‘c’ are between positions226100, 000 and326100, 000⇒ k =“coffee” is at h(“coffee”) =2.526100, 000 =9, 615◮but All words starting with ‘c’ hash to 9,615!◮Use the second letter, ‘o’ (15th letter o f the alphabet):h(“coffee”) =2.526100, 000 +14.526100, 00026=11, 759If we refined h further, we could probably asso ciate eachword with its right position. Problem #1: how?◮Problem #2: |K| is too large a value for m. We can only usea table with m ≪ |K|⇒ Oops... many keys will hash t o the same number!◮No big deal, the problem remains that of finding a proper hDuplicatesOK, so suppose we have m = 26 · 26 = 676: our hash table hasone entry for every pair of letters.◮All words that begin with “co” hash to the same result◮If there are q wo rds that begin w ith “co” in the d ictionary,all will hash to a single position◮In general, what happens if h(k1) = h(k2) for k16= k2?◮Two keys hash to the same value: the element s collide◮Put all such words in a list: chaining⇒ look for “coffee” in th at list (with size q): O(q)Instead of storing a key k at every position in the array, westore a linked list of keys.Balancedness, again◮If all keys of K hash to the same value, looking up a keytakes Θ(|K|) — it is a one-row hash table w ith a long list◮We need a h ash function to be “random”: a key k isequally likely to hash into any of the m slots in the tablee.g. There are many more w ords beginning with “co” thanwith “hx”: the 676 lists would be ve r y “unbalanced”⇒ The firs t+second letter rule is bad◮If we have such a function, then we can show that theaverage time required to search for a k ey is Θ(1 +nm)◮When hashing keys that are not numbers, you mustconvert t h em to numbers, as we did for “coffee”◮Example: 3 ·265+ 15 ·264+ 6 · 263+ 6 ·262+ 5 · 261+ 5 ·260◮Example: −2145 + 36+155+64+63+52+51Average Hash Search Time◮The number of elements to be searched is 1 more than thenumber of elements that appear before x in x′s list.◮Assuming we insert items into the list at the beginning,this is the number of elements that were insert edafter x.◮By definition: P(h(ki) = h(kj)) =1m◮Let Xijbe indicator random variable that is equal to one ifand only if h(ki) = h(kj)◮Then just compute (see p roof at page 228):E1nnXi=11 +nXj=i+1Xij= 1 +n − 12m⇒ The largest m, the quicker the search (obviously)Hash FunctionsModular Hash Function (division metho d)◮Let m be (roughly) the size of your hash table:◮h(k) = k mod m (i.e., k − mjkmk, the remainder ofkm)◮Good choice of m: A p r ime number not too close to anexact power of 2Multiplicative Hash Function◮h(k) = ⌊m(kA mod 1)⌋◮Multiply key k by A, take fractional part, and multiplyby m◮If m = 2pthis can be done very fast with bit shifting◮A ≈ φ = (√5 − 1)/2 seems a good valueBeating statistics: universal hashing◮You choose m = 52, 315, so h(k) = k mod 52, 315◮Coincidentally, all the keys k that you want to store in thehash table are of the form k = 13 + q · 52, 315.e.g. {104643, 156958, 209273, 261588, 313903, 366218}⇒ h(k) = 13 for all the inserted keys◮Or you choose m = 200, so h(k) = k mod 200◮Coincidentally, all the keys k that you want to store in thehash table are of the form k = 7 + q ·200.e.g. {407, 807, 1407, 2007}⇒ h(k) = 7 for all the inserted keys◮looks unlikely, but if we use 10,000 instead of 52,315...◮(and this is why m should not be a power of ten, or
View Full Document