Berkeley COMPSCI 186 - Tree-Structured Indexes - D2045354

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 186> Tree-Structured Indexes

DOC PREVIEW

Berkeley COMPSCI 186 - Tree-Structured Indexes

School name University of California, Berkeley

Course Compsci 186- Introduction to Database Systems

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Tree-Structured IndexesCS 186, Spring 2006, Lectures 5 &6R & G Chapters 9 & 10“If I had eight hours to chop down atree, I'd spend six sharpening my ax.” Abraham LincolnReview: Files, Pages, Records• Abstraction of stored data is “files” of “records”.– Records live on pages– Physical Record ID (RID) = <page#, slot#>•Variable length data requires more sophisticatedstructures for records and pages. (why?)– Records: offset array in header– Pages: Slotted pages w/internal offsets & free space area• Often best to be “lazy” about issues such as free spacemanagement, exact ordering, etc. (why?)• Files can be unordered (heap), sorted, or kinda sorted(i.e., “clustered”) on a search key.– Tradeoffs are update/maintenance cost vs. speed ofaccesses via the search key.– Files can be clustered (sorted) at most one way.• Indexes can be used to speed up many kinds ofaccesses. (i.e., “access paths”)Indexes: Introduction• Sometimes, we want to retrieve records by specifyingthe values in one or more fields, e.g.,– Find all students in the “CS” department– Find all students with a gpa > 3• An index on a file is a disk-based data structure thatspeeds up selections on the search key fields for theindex.– Any subset of the fields of a relation can be the search keyfor an index on the relation.–Search key is not the same as key (e.g. doesn’t have to beunique ID).• An index contains a collection of data entries, andsupports efficient retrieval of all records with a givensearch key value k.– Typically, index also contains auxiliary information thatdirects searches to the desired data entriesIndexes: Overview• Many indexing techniques exist:– B+ trees, hash-based structures, R trees, …• Can have multiple (different) indexes per file.– E.g. file sorted by age, with a hash index on salaryand a B+tree index on name.• Index Classification– What selections does it support– Representation of data entries in index• i.e., what kind of info is the index actually storing?• 3 alternatives here– Clustered vs. Unclustered Indexes– Single Key vs. Composite Indexes– Tree-based, hash-based, otherIndexes: What Selections do they support?• Selections of form field <op> constant• Equality selections (op is =)–Either “tree” or “hash” indexes help here.• Range selections (op is one of <, >, <=, >=, BETWEEN)–“Hash” indexes don’t work for these.• More exotic selections:– 2-dimensional ranges (“east of Berkeley and west of Truckeeand North of Fresno and South of Eureka”)• Or n-dimensional– 2-dimensional distances (“within 2 miles of Soda Hall”)• Or n-dimensional– Ranking queries (“10 restaurants closest to Berkeley”)– Regular expression matches, genome string matches, etc.– Keyword/Web search - includes “importance” of words indocuments, link structure, …Example Tree Index•Index entries:<search key value, page id>they direct search for data entries in leaves.• Example where each node can hold 2 entries;10* 15* 20* 27* 33* 37* 40*46*51*55*63*97*20 33 51 6340Root2Alternatives for Data Entry k* in Index• Question: What is actually stored in the leaves ofthe index for key value “k”? (i.e., what are the“data entries”?)• Three alternatives:1. Actual data record(s) with key value k2. {<k, rid of matching data record>}3. <k, list of rids of matching data records>• Choice is orthogonal to the indexing technique.– e.g., B+ trees, hash-based structures, R trees, …Alternatives for Data Entries (Contd.)• Alternative 1: Actual data record (with key value k)– If this is used, index structure is a file organizationfor data records (like Heap files or sorted files).– At most one index on a given collection of datarecords can use Alternative 1.– This alternative saves pointer lookups but can beexpensive to maintain with insertions anddeletions.Alternatives for Data Entries (Contd.)Alternative 2{<k, rid of matching data record>}and Alternative 3<k, list of rids of matching data records>• Easier to maintain than Alt 1.• If more than one index is required on a given file, atmost one index can use Alternative 1; rest must useAlternatives 2 or 3.• Alternative 3 more compact than Alternative 2, butleads to variable sized data entries even if search keysare of fixed length.• Even worse, for large rid lists the data entry wouldhave to span multiple blocks!Index Classification (continued)•Clustered vs. unclustered: If order of datarecords is the same as, or `close to’, order ofindex data entries, then called clustered index.– A file can be clustered on at most one search key.– Cost of retrieving data records through index variesgreatly based on whether index is clustered or not!– Alternative 1 implies clustered, but not vice-versa.Clustered vs. Unclustered Index• Suppose that Alternative (2) is used for data entries,and that the data records are stored in a Heap file.– To build clustered index, first sort the Heap file (withsome free space on each block for future inserts).– Overflow blocks may be needed for inserts. (Thus, order ofdata recs is `close to’, but not identical to, the sort order.)Index entriesData entriesdirect search for (Index File)(Data file)Data Recordsdata entriesData entriesData RecordsCLUSTEREDUNCLUSTEREDUnclustered vs. Clustered Indexes• What are the tradeoffs????• Clustered Pros– Efficient for range searches– May be able to do some types of compression– Possible locality benefits (related data?)– ???• Clustered Cons– Expensive to maintain (on the fly or sloppy withreorganization)3Cost ofOperationsB: The number of data pagesR: Number of records per pageD: (Average) time to read or write disk page((log2B)+B)D(because rd,wrt 0.5 file)(0.5B+1) DDelete((log2B)+B)D2DInsert[(log2 B) + #match pg]*DBDRangeSearch(log2 B) * D0.5 BDEqualitySearchBDBDScan allrecordsClustered File(67% Occupancy)Sorted FileHeap File1.5 BD(logF 1.5B) * D[(logF 1.5B) + #match pg]*D((logF 1.5B)+1)D((logF 1.5B)+1)DComposite Search Keys• Search on a combination offields.– Equality query: Every fieldvalue is equal to a constantvalue. E.g. wrt <age,sal>index:• age=20 and sal =75– Range query: Some fieldvalue is not a constant. E.g.:• age > 20; or age=20 andsal > 10• Data entries in index sortedby search key to supportrange queries.– Lexicographic order– Like the dictionary, but onfields, not letters!sue 13 75bobcaljoe

View Full Document