Metadata for the Web Issues and Simple Answers CS 502 20020221 Carl Lagoze Cornell University Cornell CS 502 Metadata is data about data Cornell CS 502 Some untested hypotheses Metadata is useful for People Machines More metadata is better semi automated digital libraries and simple metadata Cornell CS 502 Some known facts Number and variety of metadata vocabularies will continue to increase The Tower of Babel is a franchise There is not one common view of reality The one thing I know about metadata is that it is expensive Cornell CS 502 Are metadata and data distinguishable Objectivity Intellectual property Structure Aboutness Cornell CS 502 The fiction of classification there is no classification of the universe that is not fictional and conjectural Jorge Luis Borges Cornell CS 502 Lenses and Views All classification does and should provide a biased lens or view of reality Each view emphasizes certain characteristics and hides others Geospatia l Rights Museum Cornell CS 502 Reality is Complex Relationship Created by Leonardo da Vinci Created on Cornell CS 502 1506 Created by George Castaldo Created on 1994 Objects are Related IFLA Entity Model Cornell CS 502 Entities Events and Agents Photographe r Cornell CS 502 Camera type Software Computer artist Haven t we done metadata already Cornell CS 502 What s wrong with this model Expensive Complex even for its original goal Professional intervention assumes single community of expertise Monolithic One size fits all approach Reflects its centralized system origins Bias towards physical artifacts Fixed resources Incomplete handling of resource evolution and other resource relationships Anglo centric Cornell CS 502 Web Challenge to Traditional Cataloging Scale Permanence Authenticity Organizational Context Custodial Control Variety Cornell CS 502 Internet Commons includes Multiple Communities Home Pages Scientific Data Cornell CS 502 Commerce Geo Library Internet Commons Museums Whatever Realities of Web search and discovery Search systems are motivated by advertising Index coverage is unpredictable and limited Too much recall too little precision Index spam abounds Resources and their names are volatile Cornell CS 502 Metadata Part of a Solution Structured data about data helps to impose order on chaos enables automated discovery manipulation Variety across various dimensions specialization decentralization democratization Cornell CS 502 Web Metadata Issues Description vs Discovery Library cataloging motivated by describing resources Fuzzy search buckets Separate books about Sigmund Freud versus books by Sigmund Freud into different buckets But different types of data appropriate for different buckets URLs date strings word strings names But general fuzzy categories may not be sufficient for describing resources Cornell CS 502 Web Metadata Models Drill Down Searching Paradigm Moving along a specificity spectrum Inter domain vs intra domain terms models query mechanisms One size doesn t fit all Cognitive models of searching and browsing Cornell CS 502 Metadata Takes Many Forms resource discovery document administration rights management content rating security and authentication archival status products and services database schemas process control or description Cornell CS 502 cost Metadata Part of the problem AACR2 MARC google Dublin Core functionality Cornell CS 502 Metadata Challenges Accommodate multiple varieties of metadata community specific functionality creation administration access Tensions functionality and simplicity extensibility and interoperability human and machine creation and use Cornell CS 502 Interoperability has many facets Semantics Meaning classification ontology Models Structure Entities and relationships Syntax grammars to convey semantics and structure Cornell CS 502 Warwick Framework Containing Chaos Conceptual Architecture for metadata from the Warwick Metadata Workshop DC 2 Conceptual architecture to support the specification collection encoding and exchange of modular metadata Provide context for metadata efforts including Dublin Core avoids the black hole of comprehensive element sets focuses interoperability issues at package level Cornell CS 502 Metadata Container Container Package Dublin Core Package MARC record Package Indirect Reference Cornell CS 502 URI Package Terms and Conditions Modularization Allows Distributed Management Communities of expertise not software vendors are responsible for Semantics Registration Administration Access management Authority of data Sharing and Distribution Cornell CS 502 Modularization Implementation Issues Data encoding Semantic interaction of overlapping sets between semantically related packages between semantically distinct packages Type registry Cornell CS 502 Dublin Core Metadata Initiative A simple set of properties to support resource discovery on the web fuzzy search buckets A cross domain switchboard for interoperable metadata An extensible ontology for resource desciption http dublincore org Cornell CS 502 The fifteen Dublin Core Elements Creator Title Subj ect Contributor Date Description Publisher Type Format Coverage Rights Relation Source Language I dentifi er http www dublincore org documents 1999 07 02 dces Cornell CS 502 A Pidgin for Digital Tourists Metadata is language Dublin Core is a small and simple language a pidgin for finding resources across domains Speakers of different languages naturally pidginize to communicate E g tourists using simple phrases to order beer zwei Bier bitte dva pivo biru o san bai We are all tourists on the global Internet Cornell CS 502 A Grammar of Dublin Core http www dlib org dlib october00 baker 10bak er html By design not as subtle as mother tongues but easy to learn and extremely useful in practice Pidgins small vocabularies Dublin Core fifteen special nouns and lots of optional adjectives Simple grammars sentences statements follow a simple fixed pattern Cornell CS 502 Example Dublin Core statements Resource Resource Resource Resource Cornell CS 502 has has has has Title Grammar of Dublin Core Creator Tom Baker Subject Metadata Relation http foo org file htm implied verb implied subject Resource has Cornell CS 502 one of 15 properties DC Creator DC Title DC Subject DC Date property property value an appropriate literal X
View Full Document
Unlocking...