Describing and Manipulating XML Data Sudarshan S Chawathe Department of Computer Science University of Maryland College Park MD 20904 chaw cs umd edu Abstract This paper presents a brief overview of data management using the Extensible Markup Language XML It presents the basics of XML and the DTDs used to constrain XML data and describes metadata management using RDF It also discusses how XML data is queried referenced and transformed using stylesheet language XSLT and referencing mechanisms XPath and XPointer 1 Describing XML Data The Extensible Markup Language XML BPSM98 models data as a tree of elements that contain character data and have attributes composed of name value pairs For example here is an XML representation of catalog information for a book book title The spy who came in from the cold title author John lastname Le Carre lastname author price currency USD 5 59 price review author Ben author Perhaps one of the finest review review author Jerry author An intriguing tale of review bestseller authority NY Times book Text delimited by angle brackets is markup while the rest is character data Here and in the rest of this paper we introduce concepts informally as needed for our discussion for formal specifications see W3C99 Elements may contain a mix of character data and other elements e g the book element contains the text Here are some in addition to elements such as title and price The element named title contains character data denoting the book title and is contained in the book element Similarly the element price contains character data denoting the book s price This element also has an attribute named currency with value USD represented using the syntax attribute name attribute value within the element s start tag In general element names are not unique e g the book element in our example contains two review elements However attribute names are unique within an element e g the price element cannot have another attribute named currency The syntax permits an empty element bestseller bestseller to be represented more concisely as bestseller XML documents are called well formed if they satisfy simple syntactic constraints such as proper delimiting of element names and attributes and proper nesting of start and end tags Copyright 1999 IEEE Personal use of this material is permitted However permission to reprint republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists or to reuse any copyrighted component of this work in other works must be obtained from the IEEE Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 3 1 1 DTD As described above XML provides a simple and general markup facility which is useful for data interchange The simple tag delimited structure of well formed XML makes parsing extremely simple However applications that operate on XML data often need additional guarantees on the structure and content of such data For example a program that calculates the tax on the sale of a book may need to assume that each book element in its XML input includes a price subelement with a currency attribute and a numeric content Such constraints on document structure can be expressed using a Document Type Definition DTD A DTD defines a class of XML documents using a language that is essentially a context free grammar with several restrictions For example one may use the following DTD declaration to constrain XML documents such as those in our book example ELEMENT ELEMENT ELEMENT ELEMENT ATTLIST book title author price review bestseller title PCDATA author PCDATA lastname firstname fullname price PCDATA price currency CDATA USD source list regular sale list taxed CDATA FIXED yes ELEMENT bestseller EMPTY ATTLIST bestseller authority CDATA REQUIRED The first line of this declaration is an element type declaration that constrains the contents of the book element Following common convention the declaration syntax uses commas for sequencing parentheses for grouping and the operators and to denote respectively zero or one zero or more and one or more occurrences of the preceding construct Note that the declaration requires every book element to have a price sub element The second line declares the type for the title element to be parsed character data implying an XML processor will parse the contents looking for markup Note that the use of some element names e g review lastname without a corresponding declaration is not an error such elements are simply not constrained by this DTD The last two lines declare bestseller to be an entity that must be empty and that must have an authority attribute of type character data The declaration also indicates that the price element may have attributes currency of type character data and default value USD source with one of the three values shown an enumerated type and default value list and taxed with the fixed value yes The fixed attribute type is a special case of the default attribute type it mandates that the specified default value not be changed by an XML document conforming to the DTD Fixed value attributes are convenient for ensuring that data critical to processing an element type is available with the desired value without requiring it to be explicitly specified for each element of that type Our example DTD specifies that the book in our XML example must be taxed An XML document that satisfies the constraints of a DTD is said to be valid with respect to that DTD The DTD associated with an XML document may be specified using several methods one of which is the inclusion of a document type declaration DOCTYPE BOOKCATALOG SYSTEM http tt com bookcatalog dtd in a special section near the beginning of a document called its prolog This declaration indicates that the XML document claims validity with respect to the BOOKCATALOG DTD which may be found at the indicated location The data modeling facilities provided by DTDs are insufficient for many applications For example we cannot use DTDs to require that the value of the element price be a fixed precision real number in the range zero through 10000 with two digits after the point Thus our tax calculation application cannot rely on XML validity with respect to its DTD for such simple error checking The XML Schema proposal BLM 99 BM99 defines facilities that address these needs 1 2 RDF The Resource Description Framework RDF LS99 BG99 provides a general method to describe
View Full Document
Unlocking...