DOC PREVIEW
MIT 6 830 - Lecture Notes

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

6.830/6.814 — Notes∗ for Lecture 6: Access Methods Carlo A. Curino September 29, 2010 1 Announcements • Problem Set 2 is out today... due in two weeks... be aware that is going to overlap to LAB 2. Projects ideas and rules are posted online.• 2 Readings For this class the suggested readings are: • In Database Management Systems, read: – Pages 273-289. If you are using another book, this is the introduc-tion to the Section on Storage and Indexing which discusses different access methods and their relative performance. – Pages 344-358. This is an in-depth discussion of the B+Tree data structure and its implementation. Most database books, as well as any algorithms text (such as CLR or Knuth) will provide an equiva-lent discussion of B+Trees. – (was not assigned, but is an important reading) Pages 370-378 on Static Hashing and Extensible Hashing ”The R*-Tree: An Efficient and Robust Access Method for Points and • Rectangles.” Beckmann et al, in The Red Book. ∗These notes are only meant to be a guide of the topics we touch in class. Future notes are likely to be more terse and schematic, and you are required to read/study the papers and book chapters we mention in class, do homeworks and Labs, etc.. etc.. 13 Recap We presented a broad overview of the DBMS internals. We point out how important coming up with a good plan is. I claimed disk I/O can be very important. Remember I mentioned this last time: CPU cost (# of instructions) 1Ghz == 1 billions instrs/sec 1 nsec / instr RAM access 50ns I/Ocost(#ofpagesread,#ofseeks) 100MB/sec 10nsec/byte (RandomI/O=pageread+seek) 10msec/seek 100seeks/sec 1 seek = 10M instructions!!! Moreover there is a big difference from sequential and random IO. Example: • Read 10KB from a random location on disk: 10ms + 100µs = 10.1ms; • Read 10KB+10KB from a random location on disk (two blocks are next to each other): 10ms + 200µs = 10.2ms; • Read 10KB+10KB from two random locations on disk: 10ms + 100µs + 10ms +100µs= 20.2ms; WOW! So saving disk I/O, and in particular random ones is VERY impor-tant!! DB are usually well designed to: 1) avoid excessive disk I/O and try to be sequential, and 2) have a LOT of drives available (TPC-H competing machines: 144 cores AND 1296 disks!!) Today we study how to do a good job at reducing Disk I/O by organize stuff intelligently on disk, next lecture we study what we should keep in RAM. 4 Access Methods Today we get into more details on Access Methods, i.e., on the portion of the DBMS in charge of managing the data on disk. We will show a bunch of organization of data, and their performance in supporting typical accesses we need to support queries. Next lecture we will study the Buffer Manager, which tries to reduce the access to disk. What are the functionalities we need to support: scan• • search (equality) 2• search (range) insert • delete• Various access methods: • heap file: unordered, typically implemented as a linked list of pages • sorted file: ordered records, expensive to maintain • index file: data + extra structures around to quickly access data – might contain data (primary index) – or point at the data, often stored in a heapfile or other index (sec-ondary index) – if the data are sorted in the same order of the field is index, we say is a clustered index (we will see this is good for scans since disk accesses are sequential) Type of indexes: hash• B+trees• R*trees• 4.1 Data organization within file file organization: • pages • records (record ids: page id, slot id) page layout: • fixed length records page of slots, • free bit map ”slotted page” structure for var length records or slot direc-tory (slot offset, len) What about big records? Hard to place, and might overflow on another page. tuple layout (similar story): • fixed length (structure know by the system catalog, can predict where fields start) 35 • variable length, field slots, two options: delimiters or directory with point-ers/offsets What happen when the field size changes? Need to move stuff... so if we have a mix of fixed/variable, the fixed fields are best to be stored first. Null values? Good trick is using the pointers/offset.. if two have same value, it means the field in between is null... this makes storing nulls 0 space overhead. Cost model (enhanced one, from the book) • Heap Files: Equality selection on key; exactly one match. • Sorted Files: Files compacted after deletions. Indexes: • – Alt (2), (3): data entry size = 10% size of record – Hash: No overflow buckets. 80%pageoccupancy Filesize=1.25datasize→– Tree: 67% occupancy (this is typical). File size = 1.5 data size → We use: • B: number of data pages • R: number of record per page • D: average time to read/write from disk • C: average time to process a record (e.g., equality check) Extensible hashing Good read on the book: pages... 370-378 4 6 Image by MIT OpenCourseWare.BDBD1.5BDBD (R+0.15)BD (R+0.125)0.5BDDlog2BDlogF1.5BD (1 + logF0.15B)2DBDDlog2B +# matchesDlogF1.5B +# matchesDlogF0.15B +# matchesBDSearch + DSearch + BDSearch + DSearch + 2DSerach + 2D2DSearch + BDSearch + DD (3 + logF0.15B)4DHeapSortedClusteredUnclusteredtree indexUnclusteredhash indexScan Equality Range Insert Delete7 B+trees Hierarchical indices are the most common type used—e.g., B+-Trees indices typically point from key values to records in the heap file. Shall we always have indexes? What are the pros and cons? (keep the index up-to-date, extra space required) Figure 1: B+Tree graphical representation. Special case, is a “clustered” index, i.e., when the order of the tuples on disk correspond to the order in which they are stored in the index. What is it good for? (range selections, scans). 5 Courtesy of Grundprinzip on Wikipedia.32* 16*1* 5* 21* 13*10*15* 7* 19*4* 12* 20*332223000001010011100101110111Bucket ABucket BBucket CBucket DBucket A2('split image'of Bucket A)DirectoryLocal DepthGlobal DepthLocal DepthGlobal Depth32* 16*1* 5* 21* 13*10*15* 7* 19*4* 12* 20*22222200011011Bucket ABucket BBucket CBucket DBucket A2('split image'of Bucket A)DirectoryImage by MIT OpenCourseWare.MIT OpenCourseWare http://ocw.mit.edu 6.830 / 6.814 Database Systems Fall 2010 For information about citing these materials or


View Full Document
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?