Transparent Format Migration of Preserved Web ContentFormat MigrationFormat Obsolescence of Web ContentMigration of Obsolete FormatsMigration IssuesThe LOCKSS SystemLOCKSS Format MigrationProof of ConceptHTTP Format NegotiationFormat Negotiation ExamplesFormat Negotiation IllustrationFuture Work for LOCKSSTOM (Typed Object Model)TOMTOM ApplicationsJHOVEJHOVE Use in RepositoryJHOVE and LOCKSSConclusionTransparent Format Migration of Preserved Web Content D. S. H. Rosenthal, T. Lipkis, T. S. Robertson, S. MorabitoLib Magazine, 11(1), 2005http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html Slides by Frank McCownOld Dominion UniversityMarch 17, 2005Format MigrationWhat is it?Conversion of older DO format to current formatWhat other major digital preservation strategy could be used?EmulationOriginal DO format is preserved and presented to the userWhen should a DO be migrated to a new format?Format change does not imply obsolescenceFormat Obsolescence of Web ContentWeb format is obsolete when widely used browsers can no longer present the contentBackwards compatibility of browsers a mustHTML 4 vs. XHTMLOld Web formats die slowlyHow many can you think of?Emulation is difficult to implementFind older browser, original plug-in, etc.Migration of Obsolete FormatsThree migration pointsMigration on ingestConvert all incoming objects into selected format before preservingBatch migrationConvert all preserved objects into new format when preserved format is perceived to be obsoleteMigration on accessConvert preserved object into new format on-the-fly when requested by a userMigration IssuesKeep original format in case conversion tool is later found to have a bug or lost vital info when convertingConversion tool should be preserved to document original format and in case bug is found in toolChoose migration format wisely – it can significantly reduce the need and cost for future migrationsThe LOCKSS SystemLOCKSS1 - Lots Of Copies Keep Stuff Safe™Developed at Stanford UniversityOpen source, P2P software Used by libraries to ensure web accessible content (e-journals and open access material), remains available at all timesEach peer collects material to preserve by crawling publisher’s web sitePeers continually perform content consistency checks and repair content when neededPreserved material is transparently presented to user if publisher’s copy is not available (using web proxy)Currently used by 80 libraries worldwide1http://lockss.stanford.eduLOCKSS Format MigrationPlug-in format converter registers input/output MIME typesIANA MIME types - http://www.iana.org/assignments/media-types/ LOCKSS web proxy uses plug-in converters to perform on-the-fly conversion of obsolete formats (migration on access)Converters are preserved along with web content among peersProof of ConceptConvert “obsolete” GIF images to PNGProxy Web server prevents MIME type image/gif from matching any Accept: header Mismatch prompts conversion so content is delivered using the original URL but with Mime-Type=image/png. Images from Fig 1 and 2 at http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.htmlHTTP Format NegotiationBrowser can tell a web server a format is obsolete by telling it not to send that formatHTTP/1.11 defines how web servers and client browsers negotiate the format, language, and encoding of web contentBrowser sends request using Accept: header listing acceptable MIME types of content format1http://www.w3.org/Protocols/rfc2616/rfc2616.htmlFormat Negotiation ExamplesAccept: text/plain;q=0.5, text/xml;q=0.8, text/html “I prefer text/html first, text/xml second, and finally text/plain.”*/*;q=0.1“If you can’t give me what I want, give me what you have.”image/*, image/gif;q=0“Send me any kind of image except GIFs.”NOTE: q=0 semantics are not actually defined in HTTP/1.1Format Negotiation IllustrationBrowserLOCKSS ProxyWeb ServerHTTP RequestAccept: */*;q=0.1,image/gif;q=0HTTP ResponseContent-Type: image/pngGIFGIF to PNG ConverterPNGI’ll take whatever you have except obsolete GIF images.All I have are GIFs. I’ll convert them to a format the browser can handle.Future Work for LOCKSSReplace proof-of-concept implementation with complete implementation with API for plug-in convertersUse a format migration service like TOMUse JHOVE format metadata extraction and validation technology to improve the quality of format metadataTOM (Typed Object Model)Came from John Ockerbloom’s Ph.D. thesis at Carnegie Mellon1Currently managed by developers at Univ of Pennsylvania Library led by OckerbloomAddresses the problem of increasingly new and obsolete data formats that makes using digital information problematicTOM makes it possible toExplain a data formatInterpret the format for proper data extractionConvert the format into other formats1http://tom.library.upenn.edu/pubs/thesis/TOMTwo componentsData Model that describes data formats and operations that can be performed on themNetworked software that supports the description and operations of the data formatsFigure from http://tom.library.upenn.edu/intro.htmlTOM ApplicationsTOM example brokerhttp://tom.library.upenn.edu/cgi-bin/typebrowse/showtype?broker=tom%2elibrary%2eupenn%2eedu& TOM Conversion Servicehttp://tom.library.upenn.edu/convert/ Could be used by LOCKS for format migrationFred (Format Registry Demonstration)http://tom.library.upenn.edu/fred/JHOVEJSTOR/Harvard Object Validation Environment1Provides functions to perform format-specific identification, validation, and characterization of digital objects IdentificationWhat is the format of my digital object?ValidationIs my digital object really of type X?CharacterizationWhat are the significant properties of my digital object of type X?GIF examplehttp://hul.harvard.edu/jhove/gif-hul.html 1http://hul.harvard.edu/jhove/JHOVE Use in RepositoryFigure from http://hul.harvard.edu/jhove/Submission Information Package (SIP) - OAISJHOVE and LOCKSSJHOVE generates reliable format metadataLOCKSS can use JHOVE to extract quality metadata about the contents of its repositoryWhat if object to store is not valid? It may be easier to write a conversion tool using JHOVE to supply format metadataConclusionGoal is to ensure obsolete formats will not make current LOCKSS content
View Full Document