MIT 6 830 - Lecture Notes - D584651

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 830> Lecture Notes

DOC PREVIEW

MIT 6 830 - Lecture Notes

School name Massachusetts Institute of Technology

Course 6 830- Database Systems

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

6.830/6.814 — Notes∗ for Lecture 6: Access Methods Carlo A. Curino September 29, 2010 1 Announcements • Problem Set 2 is out today... due in two weeks... be aware that is going to overlap to LAB 2. Projects ideas and rules are posted online.• 2 Readings For this class the suggested readings are: • In Database Management Systems, read: – Pages 273-289. If you are using another book, this is the introduc-tion to the Section on Storage and Indexing which discusses diﬀerent access methods and their relative performance. – Pages 344-358. This is an in-depth discussion of the B+Tree data structure and its implementation. Most database books, as well as any algorithms text (such as CLR or Knuth) will provide an equiva-lent discussion of B+Trees. – (was not assigned, but is an important reading) Pages 370-378 on Static Hashing and Extensible Hashing ”The R*-Tree: An Eﬃcient and Robust Access Method for Points and • Rectangles.” Beckmann et al, in The Red Book. ∗These notes are only meant to be a guide of the topics we touch in class. Future notes are likely to be more terse and schematic, and you are required to read/study the papers and book chapters we mention in class, do homeworks and Labs, etc.. etc.. 13 Recap We presented a broad overview of the DBMS internals. We point out how important coming up with a good plan is. I claimed disk I/O can be very important. Remember I mentioned this last time: CPU cost (# of instructions) 1Ghz == 1 billions instrs/sec 1 nsec / instr RAM access 50ns I/Ocost(#ofpagesread,#ofseeks) 100MB/sec 10nsec/byte (RandomI/O=pageread+seek) 10msec/seek 100seeks/sec 1 seek = 10M instructions!!! Moreover there is a big diﬀerence from sequential and random IO. Example: • Read 10KB from a random location on disk: 10ms + 100µs = 10.1ms; • Read 10KB+10KB from a random location on disk (two blocks are next to each other): 10ms + 200µs = 10.2ms; • Read 10KB+10KB from two random locations on disk: 10ms + 100µs + 10ms +100µs= 20.2ms; WOW! So saving disk I/O, and in particular random ones is VERY impor-tant!! DB are usually well designed to: 1) avoid excessive disk I/O and try to be sequential, and 2) have a LOT of drives available (TPC-H competing machines: 144 cores AND 1296 disks!!) Today we study how to do a good job at reducing Disk I/O by organize stuﬀ intelligently on disk, next lecture we study what we should keep in RAM. 4 Access Methods Today we get into more details on Access Methods, i.e., on the portion of the DBMS in charge of managing the data on disk. We will show a bunch of organization of data, and their performance in supporting typical accesses we need to support queries. Next lecture we will study the Buﬀer Manager, which tries to reduce the access to disk. What are the functionalities we need to support: scan• • search (equality) 2• search (range) insert • delete• Various access methods: • heap ﬁle: unordered, typically implemented as a linked list of pages • sorted ﬁle: ordered records, expensive to maintain • index ﬁle: data + extra structures around to quickly access data – might contain data (primary index) – or point at the data, often stored in a heapﬁle or other index (sec-ondary index) – if the data are sorted in the same order of the ﬁeld is index, we say is a clustered index (we will see this is good for scans since disk accesses are sequential) Type of indexes: hash• B+trees• R*trees• 4.1 Data organization within ﬁle ﬁle organization: • pages • records (record ids: page id, slot id) page layout: • ﬁxed length records page of slots, • free bit map ”slotted page” structure for var length records or slot direc-tory (slot oﬀset, len) What about big records? Hard to place, and might overﬂow on another page. tuple layout (similar story): • ﬁxed length (structure know by the system catalog, can predict where ﬁelds start) 35 • variable length, ﬁeld slots, two options: delimiters or directory with point-ers/oﬀsets What happen when the ﬁeld size changes? Need to move stuﬀ... so if we have a mix of ﬁxed/variable, the ﬁxed ﬁelds are best to be stored ﬁrst. Null values? Good trick is using the pointers/oﬀset.. if two have same value, it means the ﬁeld in between is null... this makes storing nulls 0 space overhead. Cost model (enhanced one, from the book) • Heap Files: Equality selection on key; exactly one match. • Sorted Files: Files compacted after deletions. Indexes: • – Alt (2), (3): data entry size = 10% size of record – Hash: No overﬂow buckets. 80%pageoccupancy Filesize=1.25datasize→– Tree: 67% occupancy (this is typical). File size = 1.5 data size → We use: • B: number of data pages • R: number of record per page • D: average time to read/write from disk • C: average time to process a record (e.g., equality check) Extensible hashing Good read on the book: pages... 370-378 4 6 Image by MIT OpenCourseWare.BDBD1.5BDBD (R+0.15)BD (R+0.125)0.5BDDlog2BDlogF1.5BD (1 + logF0.15B)2DBDDlog2B +# matchesDlogF1.5B +# matchesDlogF0.15B +# matchesBDSearch + DSearch + BDSearch + DSearch + 2DSerach + 2D2DSearch + BDSearch + DD (3 + logF0.15B)4DHeapSortedClusteredUnclusteredtree indexUnclusteredhash indexScan Equality Range Insert Delete7 B+trees Hierarchical indices are the most common type used—e.g., B+-Trees indices typically point from key values to records in the heap ﬁle. Shall we always have indexes? What are the pros and cons? (keep the index up-to-date, extra space required) Figure 1: B+Tree graphical representation. Special case, is a “clustered” index, i.e., when the order of the tuples on disk correspond to the order in which they are stored in the index. What is it good for? (range selections, scans). 5 Courtesy of Grundprinzip on Wikipedia.32* 16*1* 5* 21* 13*10*15* 7* 19*4* 12* 20*332223000001010011100101110111Bucket ABucket BBucket CBucket DBucket A2('split image'of Bucket A)DirectoryLocal DepthGlobal DepthLocal DepthGlobal Depth32* 16*1* 5* 21* 13*10*15* 7* 19*4* 12* 20*22222200011011Bucket ABucket BBucket CBucket DBucket A2('split image'of Bucket A)DirectoryImage by MIT OpenCourseWare.MIT OpenCourseWare http://ocw.mit.edu 6.830 / 6.814 Database Systems Fall 2010 For information about citing these materials or

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

MIT 6 830 - Lecture Notes

Sign up for free to view:

Please select your school