1CMU SCS15-826: Multimedia Databasesand Data MiningPrimary key indexing – hashingC. Faloutsos15-826 Copyright: C. Faloutsos (2005) 2CMU SCSOutlineGoal: ‘Find similar / interesting things’• Intro to DB• Indexing - similarity search• Data Mining15-826 Copyright: C. Faloutsos (2005) 3CMU SCSIndexing - Detailed outline• primary key indexing– B-trees and variants– (static) hashing– extendible hashing• secondary key indexing• spatial access methods• text• ...15-826 Copyright: C. Faloutsos (2005) 4CMU SCS(Static) HashingProblem: “find EMP record with ssn=123”What if disk space was free, and time was atpremium?15-826 Copyright: C. Faloutsos (2005) 5CMU SCSHashingA: Brilliant idea: key-to-address transformation:#0 page#123 page#999,999,999123; Smith; Main str15-826 Copyright: C. Faloutsos (2005) 6CMU SCSHashingSince space is NOT free:• use M, instead of 999,999,999 slots• hash function: h(key) = slot-id#0 page#123 page#999,999,999123; Smith; Main str215-826 Copyright: C. Faloutsos (2005) 7CMU SCSHashingTypically: each hash bucket is a page, holdingmany records:#0 page#h(123)M123; Smith; Main str15-826 Copyright: C. Faloutsos (2005) 8CMU SCSHashingNotice: could have clustering, or non-clusteringversions:#0 page#h(123)M123; Smith; Main str.15-826 Copyright: C. Faloutsos (2005) 9CMU SCS123...HashingNotice: could have clustering, or non-clusteringversions:#0 page#h(123)M...EMP file123; Smith; Main str....234; Johnson; Forbes ave345; Tompson; Fifth ave...15-826 Copyright: C. Faloutsos (2005) 10CMU SCSHashing - design decisions?• eg., IRS, 200M tax returns, by SSN15-826 Copyright: C. Faloutsos (2005) 11CMU SCSIndexing- overview• B-trees• hashing– hashing functions– size of hash table– collision resolution• Hashing vs B-trees• Indices in SQL15-826 Copyright: C. Faloutsos (2005) 12CMU SCSDesign decisions1) formula h() for hashing function2) size of hash table M3) collision resolution method315-826 Copyright: C. Faloutsos (2005) 13CMU SCSDesign decisions - functions• Goal: uniform spread of keys over hashbuckets• Popular choices:– Division hashing– Multiplication hashing15-826 Copyright: C. Faloutsos (2005) 14CMU SCSDivision hashingh(x) = (a*x+b) mod M• eg., h(ssn) = (ssn) mod 1,000– gives the last three digits of ssn• M: size of hash table - choose a primenumber, defensively (why?)15-826 Copyright: C. Faloutsos (2005) 15CMU SCS• eg., M=2; hash on driver-license number(dln), where last digit is ‘gender’ (0/1 = M/F)• in an army unit with predominantly malesoldiers• Thus: avoid cases where M and keys havecommon divisors - prime M guards againstthat!Division hashing15-826 Copyright: C. Faloutsos (2005) 16CMU SCSMultiplication hashingh(x) = [ fractional-part-of ( x * ) ] * M• : golden ratio ( 0.618... = ( sqrt(5)-1)/2 )• in general, we need an irrational number• advantage: M need not be a prime number• but must be irrational15-826 Copyright: C. Faloutsos (2005) 17CMU SCSOther hashing functions• quadratic hashing (bad)• ...• conclusion: use division hashing15-826 Copyright: C. Faloutsos (2005) 18CMU SCSDesign decisions1) formula h() for hashing function2) size of hash table M3) collision resolution method415-826 Copyright: C. Faloutsos (2005) 19CMU SCSSize of hash table• eg., 50,000 employees, 10 employee-records / page• Q: M=?? pages/buckets/slots15-826 Copyright: C. Faloutsos (2005) 20CMU SCSSize of hash table• eg., 50,000 employees, 10 employees/page• Q: M=?? pages/buckets/slots• A: utilization ~ 90% and– M: prime numberEg., in our case: M= closest prime to50,000/10 / 0.9 = 5,55515-826 Copyright: C. Faloutsos (2005) 21CMU SCSDesign decisions1) formula h() for hashing function2) size of hash table M3) collision resolution method15-826 Copyright: C. Faloutsos (2005) 22CMU SCSCollision resolution• Q: what is a ‘collision’?• A: ??15-826 Copyright: C. Faloutsos (2005) 23CMU SCSCollision resolution#0 page#h(123)M123; Smith; Main str.15-826 Copyright: C. Faloutsos (2005) 24CMU SCSCollision resolution• Q: what is a ‘collision’?• A: ??• Q: why worry about collisions/overflows?(recall that buckets are ~90% full)• A: ‘birthday paradox’515-826 Copyright: C. Faloutsos (2005) 25CMU SCSCollision resolution• open addressing– linear probing (ie., put to next slot/bucket)– re-hashing• separate chaining (ie., put links to overflowpages)15-826 Copyright: C. Faloutsos (2005) 26CMU SCSCollision resolution#0 page#h(123)M123; Smith; Main str.linear probing:15-826 Copyright: C. Faloutsos (2005) 27CMU SCSCollision resolution#0 page#h(123)M123; Smith; Main str.re-hashingh1()h2()15-826 Copyright: C. Faloutsos (2005) 28CMU SCSCollision resolution123; Smith; Main str.separate chaining15-826 Copyright: C. Faloutsos (2005) 29CMU SCSDesign decisions - conclusions• function: division hashing– h(x) = ( a*x+b ) mod M• size M: ~90% util.; prime number.• collision resolution: separate chaining– easier to implement (deletions!);– no danger of becoming full15-826 Copyright: C. Faloutsos (2005) 30CMU SCSIndexing- overview• B-trees• hashing– Hashing vs B-trees• Indices in SQL• extendible hashing615-826 Copyright: C. Faloutsos (2005) 31CMU SCSHashing vs B-trees:Hashing offers• speed ! ( O(1) avg. search time)..but:15-826 Copyright: C. Faloutsos (2005) 32CMU SCSHashing vs B-trees:..but B-trees give:• key ordering:– range queries– proximity queries– sequential scan• O(log(N)) guarantees for search, ins./del.• graceful growing/shrinking15-826 Copyright: C. Faloutsos (2005) 33CMU SCSHashing vs B-trees:thus:• B-trees are implemented in most systemsfootnotes:• hashing is rarely implemented (why not?)• ‘dbm’ and ‘ndbm’ of UNIX: offer one or both15-826 Copyright: C. Faloutsos (2005) 34CMU SCSIndexing- overview• B-trees• hashing– Hashing vs B-trees• Indices in SQL• extendible hashing15-826 Copyright: C. Faloutsos (2005) 35CMU SCSIndexing in SQL• create index <index-name> on <relation-name> (<attribute-list>)• create unique index <index-name> on<relation-name> (<attribute-list>)• drop index <index-name>15-826 Copyright: C. Faloutsos (2005) 36CMU SCSIndexing in SQL• eg.,create index ssn-indexon STUDENT (ssn)• or (eg., on TAKES(ssn,cid, grade) ):create index sc-indexon TAKES (ssn, c-id)715-826 Copyright: C. Faloutsos (2005) 37CMU SCSIndexing- overview• B-trees• hashing• Indices in SQL• extensible hashing– ‘extendible’ hashing
View Full Document