Berkeley COMPSCI 186 - File Organizations and Indexing - D1913166

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 186> File Organizations and Indexing

Berkeley COMPSCI 186 - File Organizations and Indexing

School name University of California, Berkeley

Course Compsci 186- Introduction to Database Systems

Pages 4

Download Save

Unformatted text preview:

1File Organizations and IndexingLecture 4R&G Chapter 8"If you don't find it in the index, look very carefully through the entire catalogue." -- Sears, Roebuck, and Co., Consumer's Guide, 1897Review: Memory, Disks, & Buffer Mgt• Everything won’t fit in RAM (usually)• Hierarchy of storage, RAM, disk, tape• “Block” - unit of storage on disk• “Frame” – a block-sized chunk of memory• Allocate space on disk for fast access• Buffer pool management– Frames in RAM to hold blocks– Policy to move blocks between RAM & diskContextQuery Optimizationand ExecutionRelational OperatorsFiles and Access MethodsBuffer ManagementDisk Space ManagementDBFiles of Records• Blocks interface for I/O, but…• Higher levels of DBMS operate on records, and files of records.• FILE: A collection of pages, each containing a collection of records. Must support:– insert/delete/modify record– fetch a particular record (specified using record id)– scan all records (possibly with some conditions on the records to be retrieved)• Note: typically page size = block size = frame size.Record Formats: Fixed Length• Information about field types same for all records in a file; stored in system catalogs.• Finding i’th field done via arithmetic.Base address (B)L1 L2L3 L4F1 F2 F3 F4Address = B+L1+L2Record Formats: Variable Length• Two alternative formats (# fields is fixed): Second offers direct access to i’th field, efficient storage of nulls(special don’t know value); small directory overhead. $$$$Fields Delimited by Special SymbolsF1 F2 F3 F4F1 F2 F3 F4Array of Field Offsets2Page Formats: Fixed Length RecordsRecord id = <page id, slot #>. In first alternative, moving records for free space management changes rid; may not be acceptable.Slot 1Slot 2Slot N. . .. . .NM10. . .M ... 3 2 1PACKEDUNPACKED, BITMAPSlot 1Slot 2Slot NFreeSpaceSlot M11number of recordsnumberof slots“Slotted Page” for Variable Length Records•Record id = <page id, slot #>• Can move records on page without changing rid; so, attractive for fixed-length records too.• Page is full when data space and slot array meet.Page iRid = (i,N)Rid = (i,2)Rid = (i,1)Pointerto startof freespaceSLOT DIRECTORYN . . . 2 12016 24N# slotsSlot ArrayData• Q: What if a record grows too big to fit in the page???System Catalogs• For each relation:– name, file location, file structure (e.g., Heap file)– attribute name and type, for each attribute– index name, for each index– integrity constraints• For each index:– structure (e.g., B+ tree) and search key fields• For each view:– view name and definition• Plus statistics, authorization, buffer pool size, etc. Catalogs are themselves stored as relations!Attr_Cat(attr_name, rel_name, type, position) attr_name rel_name type position attr_name Attribute_Cat string 1 rel_name Attribute_Cat string 2 type Attribute_Cat string 3 position Attribute_Cat integer 4 sid Students string 1 name Students string 2 login Students string 3 age Students integer 4 gpa Students real 5 fid Faculty string 1 fname Faculty string 2 sal Faculty real 3 Alternative File OrganizationsMany alternatives exist, each good for some situations, and not so good in others:–Heap files: Suitable when typical access is a file scan retrieving all records.– Sorted Files:Best for retrieval in search keyorder, or only a `range’ of records is needed.– Clustered Files (with Indexes):A compromise between the above two extremes.Unordered (Heap) Files• Simplest file structure contains records in no particular order.• As file grows and shrinks, disk pages are allocated and de-allocated.• To support record level operations, we must:– keep track of the pagesin a file– keep track of free spaceon pages– keep track of the recordson a page• There are many alternatives for keeping track of this.– We’ll consider 23Heap File Implemented as a List • The header page id and Heap file name must be stored someplace.– Database “catalog”• Each page contains 2 `pointers’ plus data.HeaderPageDataPageDataPageDataPageDataPageDataPageDataPagePages withFree SpaceFull PagesHeap File Using a Page Directory• The entry for a page can include the number of free bytes on the page.• The directory is a collection of pages; linked list implementation is just one alternative.–Much smaller than linked list of all HF pages!DataPage 1DataPage 2DataPage NHeaderPageDIRECTORY• Q: How to find a particular record in a Heap file???Cost Model for AnalysisWe ignore CPU costs, for simplicity:– B: The number of data blocks– R: Number of records per block– D: (Average) time to read or write disk block• Measuring number of block I/O’s ignores gains of pre-fetching and sequential access; thus, even I/O cost is only loosely approximated. • Average-case analysis; based on several simplistic assumptions.– Often called a “back of the envelope”calculation. Good enough to show the overall trends!Some Assumptions in the Analysis• Single record insert and delete.• Equality selection - exactly one match (what if more or less???).• Heap Files:– Insert always appends to end of file.– Delete just leaves free space in the page.• Sorted Files:– Files compacted after deletions.– Selections on search key.Cost of OperationsB: The number of data pagesR: Number of records per pageD: (Average) time to read or write disk pageDeleteInsertRange SearchEquality Search(unique key)Scan all recordsClustered FileSorted FileHeap FileBD0.5 BDBD2D(0.5B+1)DSorted Files• Q: When do Heap files perform well? When don’t they?• Heap files are lazy on update - you end up paying on searches.• Sorted fileseagerly maintain the file on update.– The opposite choice in the trade-off• Let’s consider an extreme version– No gaps allowed, pages fully packed always– Q: How might you relax these assumptions?4Cost of OperationsB: The number of data pagesR: Number of records per pageD: (Average) time to read or write disk pageDeleteInsertRange SearchEquality Search(unique key)Scan all recordsClustered FileSorted FileHeap FileBD(log2B) * D[(log2B) +#match pg]*D((log2B)+B)D(because rd,w0.5 File)Same cost as InsertBD0.5 BDBD2D(0.5B+1)DIndexes: avoiding the extremes• Hash files are great for lots of updates and scans.• Sorted files are great for

View Full Document


School:
Email:
New Password:
Confirm Password:

Berkeley COMPSCI 186 - File Organizations and Indexing

Sign up for free to view:

Please select your school