Lecture 6 Indexing GiST Sept 13 2006 ChengXiang Zhai Most slides are adapted from Kevin Chang s lecture slides CS511 Advanced Database Management Systems 1 Search Trees Previous Approaches Specialized search trees yet another tree redundant code most trees are very similar concurrency control logging recovery tricky Trees for extensible data types B tree for any data with linear ordering e g index titles alph ordering with B tree problem does not support natural queries e g WHERE book title has database CS511 Advanced Database Management Systems 2 GiST Generalized Search Tree General cover B tree R tree etc Extensible domain specific data types queries definable Easy to extend six key methods for a new tree Efficient can match specialized trees Reusable concurrency recovery for indexes CS511 Advanced Database Management Systems 3 Example Indexing Book Titles Titles of 4 books T1 database optimization T2 web database T3 complexity of optimization algorithms T4 algorithms and complexity Indexable with extensible B tree linear ordering T4 T3 T1 T2 Note Just an example for demonstrating GiST What we will do to index titles is not the best and typical way to index textual data No notion of fuzzy relevance stay tuned for text and web search CS511 Advanced Database Management Systems 4 Queries on Titles Indexing is to accommodate query processing What predicates to ask about titles CS511 Advanced Database Management Systems 5 Queries on Titles Equality predicates WHERE book title web databases Containment predicates WHERE book title has web Prefix predicates WHERE book title start with web RegEx predicates generalize all the others WHERE book title like web database CS511 Advanced Database Management Systems 6 Extensible B Tree for Titles d c T4 alg T3 complexity w T1 database T2 web Observations indexed values have linear ordering T4 T3 T1 T2 keys are simply separators T4 c T3 d T1 w T2 CS511 Advanced Database Management Systems 7 Using B Tree What s Wrong What predicates can B tree support well equality containing prefix regex d c T4 alg T3 complexity CS511 Advanced Database Management Systems w T1 database T2 web 8 GiST Generalizing Balanced Search Trees GiST is not universal just reasonable generalization balanced tree of key ptr pairs keys can overlap GRE test B Tree R Tree R Tree What is the key generalization key1 key2 internal nodes directory leaf nodes linked list CS511 Advanced Database Management Systems 9 The Key Generalization The Key Key evolution 1 D separator 2 D MBR predicates R Tree B Tree generalizing key from 1 D line to 2 D area bounding range to minimal bounding region GiST R Tree generalizing key from 2 D MBR to predicates a predicate that all values v in subtree will satisfy B tree keys k1 k2 contains k1 k2 v R tree keys x1 y1 x2 y2 contains x1 y1 x2 y2 v RD tree keys x1 xk subset x1 xk v CS511 Advanced Database Management Systems 10 Gist for Title Indexing Predicates Must first determine predicates What query predicates to support equality equal v web db containing has v web What key predicate to use Criteria for choosing key predicates What do you suggest CS511 Advanced Database Management Systems 11 GiST for Title Indexing Predicates Key predicates Contains S v SL SR alg comp opt db opt web SLL SLR alg comp comp opt T4 alg T3 complexity CS511 Advanced Database Management Systems SRL SRR db opt db web T1 database T2 web 12 GiST Built in Tree Operations Search root R predicate q Insert root R entry E level l Delete root R entry E CS511 Advanced Database Management Systems 13 GiST Application Specific Methods Search Consistent E q search subtree E for predicate q Labeling Union E1 En how to label the union of E1 En Categorization Penalty E1 E2 penalty for inserting E2 in subtree E1 PickSplit E1 En how to split into two groups of entries Compression storage time tradeoff Compress E E Ec Decompress Ec E such that E p implies E p CS511 Advanced Database Management Systems 14 Search Operation Consistent Method Search root R predicate q traverse subtrees where Consistent true return leaf entries that are consistent CS511 Advanced Database Management Systems 15 Consistent Method Consistent E q Can E p and q both hold Does E p imply not q Title GiST key predicate p Contains S v or simply S e g SL alg comp opt e g SR db opt web Consistent SL has v web how to implement Consistent SR equals v web database how to implement CS511 Advanced Database Management Systems 16 Insert Operation Insert root R entry E level l descend tree minimizing potential increase in Penalty stop at level specified if there is room at node insert there else split according to PickSplit propagate changes using Union to adjust keys Why do we need a level parameter CS511 Advanced Database Management Systems 17 Title GiST Insert Where to insert T5 complexity of web algorithms SL SR alg comp opt db opt web SLL SLR alg comp comp opt T4 alg T3 complexity CS511 Advanced Database Management Systems SRL SRR db opt db web T1 database T2 web 18 Penalty Method Penality E1 E2 penalty for inserting E2 in subtree E1 Title GiST E2 with S comp web alg i e T5 complexity of web algorithms Where to insert root SL alg comp opt vs SR db opt web Penalty how to implement CS511 Advanced Database Management Systems 19 PickSplit Method PickSplit E1 En how to split into two groups of entries Title GiST suppose we have 3 entries after an Insert S1 alg comp S2 comp opt S3 comp web alg new how to split S1 S2 S3 into two something similar to R tree algorithm will do CS511 Advanced Database Management Systems 20 Union Method Union E1 En Generates a label for the subtree with E1 En Title GiST key predicate p Contains S v or simply S S1 alg comp S2 comp opt Combined key Union E1 SL ptr1 E2 SR ptr2 how to implement CS511 Advanced Database Management Systems 21 Compress Decompress Method Key storage vs search time tradeoff Compress E E Ec Decompress Ec E p can be looser than E p less pruning power Lossy compression may need more time for search Title GiST Any suggestions CS511 Advanced Database Management Systems 22 Title GiST Compress Decompress Example 1 no compression Compress E Ec E Decompress Ec E Ec Example 2 compress by taking word initials Compress algorithm complexity optimization al co op Decompress al co op al co op CS511 Advanced Database Management Systems 23 GiST No Magic It offers only what its model is based on It does not represent all possible index structures e g duplicate objects by multiple inserts R tree e g
View Full Document