UW CSE 444 - File Storage and Indexing - D403088

Home> Schools> University of Washington> Computer Science and Engineering (CSE) > CSE 444> File Storage and Indexing

DOC PREVIEW

UW CSE 444 - File Storage and Indexing

School name University of Washington

Course Cse 444- Database Internals

Pages 25

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

File Storage and IndexingDBs Reside on DisksReview: Disk OrganizationDisks and OSProgram ViewpointCommon Modes of Record ProcessingNaïve Relational ViewWhat's the Big Deal? (actual numbers will vary)Fact of LifeMemory vs Disk Data StructuresAvoiding SeeksA Smart File Organization...Review: HashingPitfalls of HashingIndexSome Indexing TerminologyIndexing PitfallsRequirements on Multilevel IndexesB-TreeB-Tree OverviewB-Tree ConceptsB-Tree Growth and ChangeB+ TreeVariationsOther Forms of Indexing12/1/97 Q-1© 1997 UW CSEFile Storage and IndexingChapters 4 and 512/1/97 Q-2DBs Reside on Disks•Main reasons: (main) memory is too small and too volatile•Understanding disk operation and performance is important• Data structures and algorithms for disk are very different from those appropriate for memory•The single most important fact about disks, relative to memory, is… (we shall see)12/1/97 Q-3Review: Disk Organization•Spinning magnetic surfaces, stacked•R/W head per surface•Track: circular path–Cylinder: vertical stack of tracks•Sectors: fixed length physical blocks along track–separated by gaps–overhead info (ID, error-correction, etc.)–each sector has a hardware address12/1/97 Q-4Disks and OS•OS manages disks•Files are allocated in blocks (groups of sectors)–units may be scattered on disk–OS makes file look contiguous–no simple map between logical records (program/file view) and physical placement12/1/97 Q-5Program Viewpoint•Files are collections of logical "records"–records are commonly (but not always) fixed in size and format–fields in records are commonly (but not always) fixed in size and format•Sequential access: program gets data one record at a time•Random access: program can ask for a record by its relative block # (not disk address!)12/1/97 Q-6Common Modes of Record Processing•Single record (random)…WHERE SSN=345667899•Multiple records (no special order)SELECT * FROM DEPT...•Multiple records (in order)…ORDER BY SSNAn efficient storage scheme should support all of these modes12/1/97 Q-7Naïve Relational View•Each table is a separate file–All rows in a table are of fixed size and format–Rows are stored in order by primary key12/2/97 Q-8What's the Big Deal?(actual numbers will vary)•CPU speed: 100-500MHz•Memory access speed: (100ns)how many CPU cycles is this?•Disk speed has three components–Seek time: to position the W/R head (10ms)how many memory cycles is this?–Latency: rotational speed (3600rpm)–Transfer time: movement of data to/from memory (5Mb/sec)12/1/97 Q-9Fact of LifeSeek time outweighs all other time factorsImplication: Martin's three rules for efficient file access:–1. Avoid seeks–2. Avoid seeks–3. Avoid seeks•Corollary: Any technique that takes more than one seek to locate a record is unsatisfactory.12/1/97 Q-10Memory vs Disk Data Structures•Example: Binary Search of a sorted array–Needs O(Log2N) operations•How many passes if 1,000,000 integers in memory?–The answer (20) is acceptable•How many seeks if 1,000,000 records on disk? –The answer (20) in unacceptable –And how did they get sorted (NlogN at best)?12/1/97 Q-11Avoiding Seeks•Overlapping disk operations–double buffering•Disk cache–locality principle: recently used pages are likely to be needed again–maintained by OS or DBMS•Sequential access–Seek needed only for first record•Smart file organizations12/1/97 Q-12A Smart File Organization...•One which needs very few seeks to locate any record•Hashing: go directly to the desired record, based on a simple calculation•Indexing: go directly to the desired record, based on a simple look-up12/1/97 Q-13Review: Hashing•Main idea: address of record is calculated from some value in the record–usually based on primary key•Zillions of possible hash functions–Hard to find a good hash function12/1/97 Q-14Pitfalls of Hashing•Conflicts: more than one record hashing to the same address.•schemes to overcome conflicts: overflow areas; chained records, etc.•all involve more disk I/O•Wasted space•No efficient access for other than primary key•No efficient way for sequential, especially sorted traversal12/1/97 Q-15Index•Main idea: A separate data structure used to locate records•Frequently, index is a list of key/address pairs•If index is small, a copy can be maintained in memory!–Permanent disk copy is still needed•Many, many flavors of index organization12/1/97 Q-16Some Indexing Terminology•Index field (key)•Primary index•Secondary index•Dense index•Non-dense (sparse) index•Clustering index12/1/97 Q-17Indexing Pitfalls•Each index takes space•Index file itself must permit efficient reorganization•Large indices won't fit in memory–May require multiple seeks to locate record entry•Solution to some problems: Multilevel indexing–each level is an index to the next level down12/1/97 Q-18Requirements on Multilevel Indexes•Must have low height•Must be efficiently updatable•Must be storage-efficient•Top level(s) should fit in memory•Should support efficient sequential access, if possible12/1/97 Q-19B-Tree•B-Tree is a type of multilevel index–from another standpoint: it's a type of balanced tree•Invented in 1972 by Boeing engineers R. Bayer and E. McCreight•By 1979: "the standard organization for indexes in a database system" (Comer)12/1/97 Q-20B-Tree Overview•Assume for now that keys are fixed-length and uniqueA B-tree can be thought of as a generalized binary search tree–multiple branches rather than just L or R•Some wasted space in the nodes is tolerated•Trees are always perfectly balanced12/1/97 Q-21B-Tree Concepts•Each node contains –tree (node) pointers, and–key values (and record pointers)•Order p means (up to) p tree pointers, (up to) p-1 keys•Given a key K and the two node pointers L and R around it–All key values pointed to by L are < K–All key values pointed to by R are > K12/1/97 Q-22B-Tree Growth and ChangeWhen a node is full, it splits.–middle value is propagated upward–two new nodes are at same level as original node•Height of tree increases only when the root splits•Recommended: split only on the way down•On deletion: two adjacent nodes recombine if both are < half full12/1/97 Q-23B+ Tree•Like B-Tree, except–record pointers are only in the leaves•left-side pointer has keys <= rather than <–leaf nodes

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-24-25 out of 25 pages.

UW CSE 444 - File Storage and Indexing

Sign up for free to view:

Please select your school