Unformatted text preview:

Home PageTitle PageContentsJJ IIJ IPage 1 of 28Go BackFull ScreenCloseQuitTree Based Indexes• Want to guarantee worst case time O(log n) for dictionary opera-tions on ordered data– Primarily want to work with large amounts of data– Also Want to support range queries efficiently• The data structure is called an index because– Data is stored with various ordered attributes (such as age,salary, employee identification number)– The range query may come on any of the attributes– We do not want to duplicate the dataHome PageTitle PageContentsJJ IIJ IPage 2 of 28Go BackFull ScreenCloseQuitStoring Large Amounts of Data• Store choice for large amounts of data is disk, not RAM or tape• Why not store in main memory?– Costs too much. Rs. 4000 will buy you either 256MB of RAMor 40GB of data– Volatile.• Reading or writing data on disk is slow (about 8ms).• Data is stored and retrieved in units called disk blocks or pages• Note that in fact, we have a memory hierarchy for even slower dataaccess (tapes for archival purposes)Home PageTitle PageContentsJJ IIJ IPage 3 of 28Go BackFull ScreenCloseQuitBinary Trees May Not Suffice• Assume we have 10 million items, each key is a name occupying 32bytes, and a record is 256 bytes– Average successful search takes 1.38 lg 107≈ 1.38 lg 221≈ 32disk accesses– In a time shared environment with 20 users, this might takeabout 5 seconds– Worse, a few nodes might be three times deeper and we need100 disk accesses or about 16 seconds• So searching 4 records would take about a minute• We might be willing to write complicate code to get better perfor-mance (e.g., 3 disk accesses instead of 32 or 100)Home PageTitle PageContentsJJ IIJ IPage 4 of 28Go BackFull ScreenCloseQuitMulti-way Branches Are Sometimes Useful• We get logmn performance for n nodes using m-way branches• A node now has m − 1 pieces of data and m pointers– If page size is 8192 = 213bytes a block contains 32(m − 1) keys.– We store m pointers to other blocks, and each pointer is 4 bytes.Thus total storage is 36m − 32 = 8192 or m = 228– An m-ary tree of height 3 has 2283≥ 10 million nodes– The search time is about 0.47 second.– And we get other pieces of data in our buffer!• 32 has reduced to 3, but we probably don’t want to write compli-cated code to make 3 main memory accesses instead of 100 (memoryaccess time is in nanoseconds)• But maybe we cannot afford enough RAMHome PageTitle PageContentsJJ IIJ IPage 5 of 28Go BackFull ScreenCloseQuitThe DBMS Provides Many Abstractions• For several reasons, a programmer doesn’t deal directly with disksbut is provided various abstractions by the database managementsystem (DBMS)– The time to access a page depends on its location on disk. Care-ful placement of pages on the disk to exploit the geometry canminimize access time– The illusion of several disks (in an array) posing as one disk canbe provided• The buffer manager brings pages into RAM• Database pages are organized into files (not to be confused withoperating system files), and higher level DBMS code view the dataas a collection of records with record id (rid) pointing to physicaldisk pagesHome PageTitle PageContentsJJ IIJ IPage 6 of 28Go BackFull ScreenCloseQuitFiles of Records• So instead of asking for pages, the programmers asks to– Insert/delete/modify records– Read a particular record (using rid)– Scan a subset of records (with some conditions on the search)• The records are in one to one correspondence with data, and so donot fit in main memory either• File structures include– Unordered, linked list like or sorted files– Or hashed files (a directory like structure)• An index is an auxiliary structure to the basic file structure that isvalue-based (rather than rid based)Home PageTitle PageContentsJJ IIJ IPage 7 of 28Go BackFull ScreenCloseQuitIndexes• An index contains a collection of data entries and supports efficientretrieval of all data entries with a given key value k• A data entry consists of the key k∗ and enough information toretrieve the physical record that contains data matching the key k• So for an employee database– We might have an index on salary. The index (stored on disk)will now contain pointers to relevant pages that will supportefficient access of salary information– We might have an index on age. The index file now containspointers to support efficient access of age related information• As mentioned earlier, if data entries are stored in sorted order,binary search can be performed, but this is costly• Stores data entry pages and, in addition, index pages in a tree-likefashion is more efficient.Home PageTitle PageContentsJJ IIJ IPage 8 of 28Go BackFull ScreenCloseQuitThe Basic Abstraction of B+ trees• So just like non-linear main memory data structures such as AVLtrees, we have B+ trees• The basic abstraction we care about for an object x– If the object is in main memory, we can refer to the fields of theobject– Otherwise, we need to perform a DiskRead(x) to fetch the data– We write the data back to disk using DiskWrite(x)• We do not care that underneath the hood, the file manager and thebuffer manager are called to manage physical pages on disk• We want to minimize these disk accesses.• We prefer a one-pass algorithm instead of the usual top-down bot-tom up passes (such as in the rebalancing method of AVL trees)Home PageTitle PageContentsJJ IIJ IPage 9 of 28Go BackFull ScreenCloseQuitB+-Tree Node Structure• Rooted tree where every node x has n[x] the number of keys cur-rently stored in x, the n[x] keys themselves stored in nondecreasingorder so that keys k1[x] ≤ k2[x] ≤ . . . kn[x][x], and a boolean valueleaf[x] to indicate whether x is a leaf• Each internal node also contains n[x] + 1 child pointers ci[x]• Leaf nodes are data entry nodes and point to the actual data.• The keys ki[x] separate the range of keys stored in each subtree: Ifpiis any key stored in the subtree with root ci[x], then p1< k1[x] ≤p2< k2[x] ≤ . . . < kn[x][x] ≤ pn[x]+1Home PageTitle PageContentsJJ IIJ IPage 10 of 28Go BackFull ScreenCloseQuitB+-Tree Constraints• All leaf nodes are at the same depth and chained together in adoubly linked list• For t ≥ 2– Each node other than the root must have at least t − 1 keys.– Every node can contain at most 2t − 1 keys.– The root node has 1 to 2t − 1 keys (unless the tree is empty)• Some things to note–


View Full Document

UMD CMSC 420 - Tree Based Indexes

Download Tree Based Indexes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Tree Based Indexes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Tree Based Indexes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?