File system funThe medium is the messageMemory vs. DiskSome useful factsThe equation that ruled the world.Files: named bytes on diskWhat’s so hard about grouping blocks???FS vs VMProblem: how to track file’s data?Simple mechanism: contiguous allocationSlide 12Linked filesSlide 14Example: DOS FS (simplified)FAT discussionSlide 17Indexed filesSlide 19Slide 20Multi-level indexed files: ~4.3 BSDUnix discussionMore about inodesExample: (oversimplified) Unix file systemDirectoriesA short history of timeHierarchical UnixNaming magicUnix example: /a/b/c.cDefault context: working directoryCreating synonyms: Hard and soft linksMicro-case study: speeding up a FSA plethora of performance costsProblem 1: Too small block sizeHandling internal fragmentationProb’ 2: Where to allocate data?Clustering related objects in FFSClustering in FFSWhat’s a cylinder group look like?Prob’ 3: Finding space for related objectsUsing a bitmapSo what did we gain?Other hacks?File system funFile systems = the hardest part of OS–More papers on FSes than any other single topicMain tasks of file system:–don’t go away (ever)–associate bytes with name (files)–associate names with each other (directories)–Can implement file systems on disk, over network, in memory, in non-volatile ram (NVRAM), on tape, w/ paper.–We’ll focus on disk and generalize laterToday: files and directories + a bit of speed.Processes, vm, synchronization, fs all around since 60’s or earlier. Clear winProcesses, vm, synchronization, fs all around since 60’s or earlier. Clear winThe medium is the messageDisk = First thing we’ve seen that doesn’t go away–So: Where everything important lives. Failure. Slow (ms access vs ns for memory)Huge (100x bigger than memory)–How to organize large collection of ad hoc information? Taxonomies! (Basically FS = general way to make these)memorycrashProcessor speed: ~2x/yrDisk access time: 7%/yrOptimization = usability. Cache everything: files, directories, names, non-existant namesOptimization = usability. Cache everything: files, directories, names, non-existant namesMemory vs. DiskSmallest write: sectorAtomic write = sector ~10ms–not on a good curve20MB/sNUMACrash?–Contents not gone (“non-volatile”)–Lose? Corrupt? No ok.(usually) bytesbyte, word Random access: nanosecs–faster all the timeSeq access 200-1000MB/sUMACrash?–Contents gone (“volatile”)–Lose + start over = okDisk is just memory. We already know memory. But there are some differences.The big difference: minimum transfer unit, that it doesn’t go away (multiple writes, crash = ?). Note: 100,000 in terms of latency, only 10x in terms of bandwidth.Disk is just memory. We already know memory. But there are some differences.The big difference: minimum transfer unit, that it doesn’t go away (multiple writes, crash = ?). Note: 100,000 in terms of latency, only 10x in terms of bandwidth.Some useful factsDisk reads/writes in terms of sectors, not bytes–read/write single sector or adjacent groupsHow to write a single byte? “Read-modify-write”–read in sector containing the byte–modify that byte–write entire sector back to disk–key: if cached, don’t need to read inSector = unit of atomicity. –sector write done completely, even if crash in middle »(disk saves up enough momentum to complete)–larger atomic units have to be synthesized by OSJust like we built large atomic units from small atomic instructions, we’ll build up large atomic ops based on sector writes.Just like we built large atomic units from small atomic instructions, we’ll build up large atomic ops based on sector writes.Can happen all the time on alphas --- only do word ops, so to write a byte, had to do a read, modify it, and write it out. Means you can now have a cache miss, which happens here too. RMW for assigning to a bit in memory. Means non-atomic.Can happen all the time on alphas --- only do word ops, so to write a byte, had to do a read, modify it, and write it out. Means you can now have a cache miss, which happens here too. RMW for assigning to a bit in memory. Means non-atomic.The equation that ruled the world.Approximate time to get data:So?–Each time touch disk = 10s ms. –Touch 50-100 times = 1 *second*–Can do *billions* of ALU ops in same time.This fact = Huge social impact on OS research–Most pre-2000 research based on speed.–Publishable speedup = ~30%–Easy to get > 30% by removing just a few accesses.–Result: more papers on FSes than any other single topicseek time(ms) + rotational delay(ms) + bytes / disk bandwidthFiles: named bytes on diskFile abstraction:–user’s view: named sequence of bytes –FS’s view: collection of disk blocks–file system’s job: translate name & offset to disk blocks File operations:–create a file, delete a file–read from file, write to fileWant: operations to have as few disk accesses as possible & have minimal space overheadoffset:int disk addr:intint main() { … foo.cThe operations you do on a nounThe operations you do on a nounIn some sense, the problems we will look at are no different than those in virtual memory–like page tables, file system meta data are simply data structures used to construct mappings.–Page table: map virtual page # to physical page #–file meta data: map byte offset to disk block address–directory: map name to disk address or file #What’s so hard about grouping blocks???Page table28 33Unix inode 418 8003121directoryfoo.c44Like usual, we’re going to call the same thing by different names. We’ll be using lists and trees of arrays to track integers, but instead of calling them that or page tables, now meta data. Purpose the same: construct a mapping.Like usual, we’re going to call the same thing by different names. We’ll be using lists and trees of arrays to track integers, but instead of calling them that or page tables, now meta data. Purpose the same: construct a mapping.In some ways problem similar: –want location transparency, oblivious to size, & protectionIn some ways the problem is easier: –CPU time to do FS mappings not a big deal (= no TLB)–Page tables deal with sparse address spaces and random access, files are dense (0 .. filesize-1) & ~sequentialIn some way’s problem is harder:–Each layer of translation = potential disk access–Space a huge
View Full Document