Outline XML Advantages of XML Things to note Example XML documents as trees Outline Elements Tags and elements Attributes and Values Attributes vs. Sub-elements ID attributes Document Prolog Document Prolog Document Prolog Document Prolog Document Prolog Entities Outline Using CSS to display XML An example Outline Parsing XML Parsing XML Libraries Libraries Parsing a document in Python Traversing the tree Finding specific elements Finding attribute/value pairs An example XPath XPath Examples Outline Parsing with SAX SAX: Simple API for XML Using SAX within Python Using SAX within Python SAX commentsDistributed Software DevelopmentXMLChris BrooksDepartment of Computer ScienceUniversity of San FranciscoDepartment of Computer Science — University of San Francisco – p. 1/??Outline•About XML•Structuring XML documents•Using CSS to display XML•Parsing with DOM•Parsing with SAXDepartment of Computer Science — University of San Francisco – p. 2/??XML•XML is a language for describing data•Really more of a meta-language•XML itself provides metadata•Data types, relations between data objects, etc.•Designed to be read, created, and consumed byprograms.Department of Computer Science — University of San Francisco – p. 3/??Advantages of XML•Well-defined, easy-to-manipulate structure•Human-readable•Extensible•Metadata can be included directly with data•Widely usedDepartment of Computer Science — University of San Francisco – p. 4/??Things to note•An XML document has two components:•tags (metadata)•content (data)•Metadata serves to help an application make senseof the data.Department of Computer Science — University of San Francisco – p. 5/??Example<?xml version="1.0"?><book><author> J.R.R. Tolkien </author><title> The Lord of the Rings </title><volumes><volume> Fellowship of The Ring </volume><volume> The Two Towers </volume><volume> Return of the King </volume></volumes><price> 14.95 </price><publisher> Ballantine </publisher><isbn> 0345340426 </isbn></book>Department of Computer Science — University of San Francisco – p. 6/??XML documents as trees•An XML document can also be represented as a tree.•This makes XML very easy to parse.•The outermost element is the root element, andelements contained within it are children of thatelement.•Content is stored at the leaves•What would the tree for our Tolkien example look like?Department of Computer Science — University of San Francisco – p. 7/??Outline•About XML•Stucturing XML documents•Using CSS to display XML•Parsing with DOM•Parsing with SAXDepartment of Computer Science — University of San Francisco – p. 8/??Elements•XML requires that every starting tag have acorresponding closing tag.•Everything between a starting tag and a closing tag iscalled an element•For example, <volume>Return of The King </volume>is an element•So is everything between <volumes> and </volumes>•As is everything between <book> and </book>.•This means that elements must be nested.Department of Computer Science — University of San Francisco – p. 9/??Tags and elements•Tags form the boundaries of elements, and giveprocessing instructions to parsers.•Empty elements: <coAuthor /> All information iscontained in the tag.•Container elements: <price> 14.95 </price>•Comments: <!-- here’s a comment -->•Declaration: <!ENTITY jrrt ‘‘J.R.R.Tolkien> This provides a way to definevariables or constants in a single location.•Entity reference: <author> &jrrt </author>Department of Computer Science — University of San Francisco – p. 10/??Attributes and Values•You can also specify that an element has attributes•These attributes can take on values•This is helpful when you want to specify that an objectbelongs to one of a few types.<book genre="fantasy" size="large"> ...</book>Department of Computer Science — University of San Francisco – p. 11/??Attributes vs. Sub-elements•We could rewrite the example above usingsubelements instead of attributes.•When to use one over the other is largely stylistic.•Can always transform one into the other•If a feature can only take on one of a few values, anattribute might make more sense.•If we expect to extend the number of genres, asubelement is preferable.•Also, order is preserved for subelements•Semantically, attribute/value pairs are treated as adictionary.•So, a list of authors should be done as subelementsDepartment of Computer Science — University of San Francisco – p. 12/??ID attributes•A particularly helpful attribute is ID - this lets youassign a reference to an element and refer to it laterin the document.<volume id="book1"> Fellowship of the Ring </volume><volume id="book2"> The Two Towers. Read this book after you’vefinished <volumeref idref="book1’’ />.</volume>•The ref tag refers to a previous volume•This provides the XML parser with the informationthat this is a reference to a previous volume with id“book1”.Department of Computer Science — University of San Francisco – p. 13/??Document Prolog•If you’ve looked at XML that’s used by otherapplications, you’ve probably noticed a lot ofmessy-looking stuff at the top.•This is called the document prolog.•This tells a client that the document is in XML andrefers it to other document that indicate which tagsare valid.<?xml version="1.0" encoding="US-ASCII" standalone="no"><!DOCTYPE bookPUBLIC "-//USF //DTD Book 1.8//EN""http://www.foobar.com/DTDs/lotr.dtd"[<!ENTITY jrrt "J.R.R. Tolkien"><!ENTITY elvish-key "elvish.xml">]>Department of Computer Science — University of San Francisco – p. 14/??Document Prolog<?xml version="1.0" encoding="US-ASCII" standalone="no">•This is the XML declaration.•It indicates that the document is XML, the encodingschema, and whether or not the client will need tofetch supporting documents.Department of Computer Science — University of San Francisco – p. 15/??Document Prolog<!DOCTYPE book•This is the document type declaration - it indicatesthat the root element in the XML document is a book.Department of Computer Science — University of San Francisco – p. 16/??Document PrologPUBLIC "-//USF //DTD Book 1.8//EN""http://www.foobar.com/DTDs/lotr.dtd"•These lines designate a document type definition.•Basically, this points to a separate document (called aDTD) that describes what elements books areallowed to have.Department of Computer Science — University of San Francisco –
View Full Document