Unformatted text preview:

IE172 – Algorithms in Systems EngineeringLecture 13Pietro BelottiDepartment of Industrial & Systems EngineeringLehigh UniversityFebruary 9, 2009(Midterm on February 23)Sorting a nd searching algorithms: what f or?◮Sorting and searching are usually done on a key◮The key is a unique identifier of an object th at contains avalue, i.e., more informatione.g. A medical record (value) is associated with a social securitynumber (key)e.g. An airport and all its flight information (value) is uniquelyidentified by a three-letter key◮When sorting a list of keys, we always assume that eachkey is only part of a pair(k, v); k is the key and v the value◮When searching in an array, RB tree, etc., we assume thatonce we’ve found the k ey we can also access the value◮There would be little use for searching/sort ing algorithmsotherwise...Fast search: is log n fast enough?You want to find a key in a sorted array A. You can◮Use binary search, which takes O(log n) time. Or◮Use an array, B, where B[i] is the position in A of key i.e.g. If A is 2 5 6 9 10 12 thenB is- - 0 - - 1 2 - - 3 4 - 5◮Indeed B[2]==0 and A[B[2]]==2This isdirect addressing, as fast as O(1). Problems:◮If A is 42,691,023 410,415,623 1,241,090,221 B ne eds acouple of GB of memor y for an array of 3 elements...◮If A contains strings, real numbers, or other types (classes),what index do we use?◮If A has duplicate keys, which one do we choose?Fast search in a n array◮What we actually need is a function f that tells us where anelement is in an array◮f takes an argument (410,415,623) and returns an inde x (1).◮We saw one such function f, with complexity O(log n): it isthe binary search procedure in binary search t rees◮Can we find a function with better p erformance (possiblynot requiring 1GB of memory?)◮Yes, with Hash Tables◮A hash table is a data structure in which we can “look up”an element efficiently.◮The expected time to search for an element in a h as table inO(1) (worst case time in Θ(n)).More on Hash◮A hash table is a table, i.e., an array, with m elements —usually m ≪ n as otherwise the array would be too large◮Given a key k, we don’t use k itself as the index into thearray1– rather, we have ahash function h, and we use h(k)as an index into the array.◮Given a “universe” K of keys w ith n = |K| (e.g. words in adictionary, social security numbers, Lehigh usernames. . . )◮h : K → {0, 1, . . . m −1}, so that h(k) gets mapped to aninteger bet ween 0 and m − 1 for every k ∈ K◮We say that k hashes to h(k)◮h converts a key (string, large numbers , real numbers) intoan integer number from 0 to m −1e.g. h(42,691,023) = 0; h(410,415,623) = 1; h(1,241,090,221) = 21We saw why two slides ago: k could be large, fractional, or a string.The thing that should not be (don e)◮Suppose m = |K|. Find “coffee” in a dictionary, |K|=100,000◮‘c’ is the third of our 26-letter alphabet. Words startingwith ‘c’ are between positions226100, 000 and326100, 000⇒ k =“coffee” is at h(“coffee”) =2.526100, 000 =9, 615◮but All words starting with ‘c’ hash to 9,615!◮Use the second letter, ‘o’ (15th letter o f the alphabet):h(“coffee”) =2.526100, 000 +14.526100, 00026=11, 759If we refined h further, we could probably asso ciate eachword with its right position. Problem #1: how?◮Problem #2: |K| is too large a value for m. We can only usea table with m ≪ |K|⇒ Oops... many keys will hash t o the same number!◮No big deal, the problem remains that of finding a proper hDuplicatesOK, so suppose we have m = 26 · 26 = 676: our hash table hasone entry for every pair of letters.◮All words that begin with “co” hash to the same result◮If there are q wo rds that begin w ith “co” in the d ictionary,all will hash to a single position◮In general, what happens if h(k1) = h(k2) for k16= k2?◮Two keys hash to the same value: the element s collide◮Put all such words in a list: chaining⇒ look for “coffee” in th at list (with size q): O(q)Instead of storing a key k at every position in the array, westore a linked list of keys.Balancedness, again◮If all keys of K hash to the same value, looking up a keytakes Θ(|K|) — it is a one-row hash table w ith a long list◮We need a h ash function to be “random”: a key k isequally likely to hash into any of the m slots in the tablee.g. There are many more w ords beginning with “co” thanwith “hx”: the 676 lists would be ve r y “unbalanced”⇒ The firs t+second letter rule is bad◮If we have such a function, then we can show that theaverage time required to search for a k ey is Θ(1 +nm)◮When hashing keys that are not numbers, you mustconvert t h em to numbers, as we did for “coffee”◮Example: 3 ·265+ 15 ·264+ 6 · 263+ 6 ·262+ 5 · 261+ 5 ·260◮Example: −2145 + 36+155+64+63+52+51Average Hash Search Time◮The number of elements to be searched is 1 more than thenumber of elements that appear before x in x′s list.◮Assuming we insert items into the list at the beginning,this is the number of elements that were insert edafter x.◮By definition: P(h(ki) = h(kj)) =1m◮Let Xijbe indicator random variable that is equal to one ifand only if h(ki) = h(kj)◮Then just compute (see p roof at page 228):E1nnXi=11 +nXj=i+1Xij= 1 +n − 12m⇒ The largest m, the quicker the search (obviously)Hash FunctionsModular Hash Function (division metho d)◮Let m be (roughly) the size of your hash table:◮h(k) = k mod m (i.e., k − mjkmk, the remainder ofkm)◮Good choice of m: A p r ime number not too close to anexact power of 2Multiplicative Hash Function◮h(k) = ⌊m(kA mod 1)⌋◮Multiply key k by A, take fractional part, and multiplyby m◮If m = 2pthis can be done very fast with bit shifting◮A ≈ φ = (√5 − 1)/2 seems a good valueBeating statistics: universal hashing◮You choose m = 52, 315, so h(k) = k mod 52, 315◮Coincidentally, all the keys k that you want to store in thehash table are of the form k = 13 + q · 52, 315.e.g. {104643, 156958, 209273, 261588, 313903, 366218}⇒ h(k) = 13 for all the inserted keys◮Or you choose m = 200, so h(k) = k mod 200◮Coincidentally, all the keys k that you want to store in thehash table are of the form k = 7 + q ·200.e.g. {407, 807, 1407, 2007}⇒ h(k) = 7 for all the inserted keys◮looks unlikely, but if we use 10,000 instead of 52,315...◮(and this is why m should not be a power of ten, or


View Full Document

Clemson IE 172 - Lecture 13

Download Lecture 13
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 13 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 13 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?