DOC PREVIEW
Keeping Digital Documents Usable

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Keeping Digital Documents Usable: Managing Data FormatsWhat I’ll be talking aboutThe problem of electronic document preservationData format mismatchStandards are a partial solutionTechologies for usefully preserving unfamiliar formatsAn archived documentIts raw electronic formTOM Basic Data ConceptsExample: A simple mail message as a TOM objectPart of the TOM type hierarchySubstitutable types?What can you do with an object in an unfamiliar format?The architecture supporting TOM (simplified)TOM in actionBrokers enable smart conversionsRespectful conversions?Choices in converting our mail messageWhat’s good about TOM’s design?Sharing the work: Key for successful digital libraries, archivesKeeping Digital Documents Usable:Managing Data FormatsJohn Mark OckerbloomCarnegie Mellon UniversityApril 27, 1999What I’ll be talking about•A data model and architecture that supports definition, use, and conversion of an arbitrarily large number of data formats•How this model helps digital libraries and archives keep electronic documents accessible and usable over long termThe problem of electronic document preservation•We can digitize lots of information, but it can become inaccessible very quickly–(150 year old book vs. 5 year old 5 1/4” floppy)•Electronic preservation problems differ sharply from print preservation problems•Preserving the bits is easy: just replicate them–(Remove the hardware dependency first if you can)–Internet allows very wide replication, avoiding single-archive failures•The problem is understanding the bits, so that you can continue to use them...Data format mismatch•In a large, diverse, digital archive, information comes in a variety of formats•Most clients only understand a few formats•They therefore cannot effectively use many materials–data may be in incomprehensible form–data may be in form not easily worked with•Particularly problematic:–formats that have complex (but useful) structure –legacy data and programs (obsolete format assumptions)•In a long-lived library, most information IS “legacy”Standards are a partial solution•Standards allow common understandings…–Data: SGML/XML, Word processor formats, HTML, PDF, Quark, specialized scientific formats, page image formats….–Metadata: USMARC, Dublin Core, RDF... •…But no one standard fits all–different uses may require different data choices–“lowest common denonimator” often not good enough•And standards change over time–needs and applications change (sometimes quickly)–standardization process lags–even established standards become obsolete»Who supports EBCDIC now?»Who will support 1999 standards in 2049?Techologies for usefully preserving unfamiliar formats •Emulation–don’t change the data; maintain programs to deal with it–Essentially data abstraction, since the “emulation” just needs to provide same functionality, and may be implemented very differently from original–But: May be costly to maintain infrastructure; may unnecessarily lock user into old interaction styles•Migration–Periodically convert data to more “up-to-date” formats; then use your everyday programs on it–But: How do you control information loss?An archived documentIts raw electronic formFrom: Sherry T Haddock <[email protected]>To: [email protected]: CAETI Community Meeting InfoDate: Thu, 15 Feb 1996 17:12:52 -0600 (CST)Mime-Version: 1.0Content-Type: MULTIPART/MIXED; BOUNDARY="608184028-521714262-824425972=:20798"Cc: Sherry T Haddock <[email protected]> This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to [email protected] for more info.--608184028-521714262-824425972=:20798Content-Type: TEXT/PLAIN; charset=US-ASCIIHere are maps detailing the March CAETI Community Meeting Location. ... Thanks again, Sherry <[email protected]>--608184028-521714262-824425972=:20798Content-Type: TEXT/PLAIN; charset=US-ASCII; name="CaetiMap.hqx"Content-Transfer-Encoding: BASE64Content-ID: <[email protected]>Content-Description: KFRoaXMgZmlsZSBtdXN0IGJlIGNvbnZlcnRlZCB3aXRoIEJpbkhleCA0LjApDQoNCjojODBLQ0E0VCklZUtGISI2NiUzYzgmIjgtYCMzIiQpIU4hLSlEMyVtZC1tNGkrJ2EnWiUhTiIhbCEhLSFyW20NCg0KKiEhQiFOIVgiISohJCEzIzMjIiEhISFtIU4hLSIhKiEkcltxMyFgIzMjMnEzcnJxM1hJaHJOIS1BISohJCFgIw0KDQozIWAzIU4hLSYhKiEkIkojMyFgRiFOIS0pISohJCMzIzMhYFMhTiEtLCEqISQkISMzIWBkIU4hLTEhKiEkcltxDQoNCjMhcmxyTiEtNCEqISQlSiMzIWEtIU4hLTghKiEkJjMjMyFhQiFOITJxcmohJHJbcTNycnEzVCEiNSEqIXE(Emailed,MIME-attached, base64, binhexed, Powerpoint 3)TOM Basic Data Concepts•An object is a (non-mutable) typed value–objects can come from anywhere, be passed about (cf. MIME)•A type specifies abstractly what can be done with an object–includes attributes, methods, with slots for specs...•An encoding is a mapping from one type to another type that represents the original type–usually maps to a simpler type–``lowest-level’’ type: byte sequence•A format is a representation of a type as a byte sequence. –As a type plus a sequence of encodings–Conversions can be defined between formatsExample: A simple mail message as a TOM object• Value is the content of the message•Type is simple ``mail message’’ type–attributes: sender, recipients, header, body...–methods: get_attachment (num)...•This ``mail message’’ type has encodings:–``standard’’ encoding is RFC822 encoding (as ASCII byte sequence)•Format is just a type with encodings:–``mail message’’ type in ``standard’’ encoding–could also have further encodings (e.g. mail header)•TOM ships objects around with format tagsPart of the TOM type hierarchyObjectPackageReferenceURLPowerpointMail messageBinhexCommunicationSubtypes are substitutable for supertypes (cf. Liskov & Wing)Substitutable types?•What are they?–In a ``substitutable’’ subtyping model, objects of type T behave exactly like objects of T’s supertypes when used through supertype interfaces–Conceptually, there’s an ``abstraction mapping’’ where each object in a subtype has a corresponding object in the supertype that can “substitute” for it•Why are they important?–They allow unfamiliar types to be used through familiar supertype interfaces, with information and behavior guaranteed to be consistent with the supertype.–(Most other OO systems


Keeping Digital Documents Usable

Download Keeping Digital Documents Usable
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Keeping Digital Documents Usable and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Keeping Digital Documents Usable 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?