Slides 07: XML

CS122B: Projects in Databases and Web Applications Winter 2017 Notes 05: XML Professor Chen Li Department of Computer Science UC Irvine CS122B 1 Outline XML basics DTD Parsing XML (SAX/DOM) CS122B

2 An HTML document CS122B 3 HTML code

Hull California 1995

Su Purdue

CS122B book BOOKS article

loc=library ref 123 author year 555 author title title Hull

1995 Su Purdue California 22 Special Characters Some characters need to be escaped because they have special significance:

< < > > & & ' " If they were not escaped - would be processed as markup by XML engine CS122B 23 Prolog in XML Files XML file always starts with a prolog

The minimal prolog contains a declaration that identifies the document as an XML document: The declaration may also contain additional information version - version of the XML used in the data encoding - Identifies the character set used standalone - whether the document references an external entity or data type specification CS122B 24 An Example

Introduction to CML Overview Why is XML great? CS122B 25 Next: Data Type Definition (DTD)

CS122B 26 Data Type Definition (DTD) DTD specifies the types of tags that can be included in the XML document it defines which tags are valid, and in what arrangements where text is expected, letting the parser determine whether the whitespace it sees is significant or ignorable An optional part of the document prolog CS122B

27 XML DTD XML document and DTD Slideshow Slideshow + Slide slide item item1

* slide item title item item2 DB item title AI item3 XML Document CS122B

title 28 Qualifiers Qualifier Meaning ? Optional (zero or one) *

Zero or more + One or more CS122B 29 Defining Text PCDATA: Parsed Character DATA (PCDATA) "#" that precedes PCDATA indicates that what follows is a special word, rather than an element name CDATA: Unparsed character data Normally used for embedding scripts (such as Javascript scripts).

Comparison: asp CS122B 30 Attribute Types Attribute Type Specifies... CDATA ID IDREF IDREFS ENTITY ENTITIES NMTOKEN

NMTOKENS NOTATION "Unparsed character data" = a text string.) A name that no other ID attribute shares. A reference to an ID defined elsewhere in the document. A space-separated list containing one or more ID references. The name of an entity defined in the DTD. A space-separated list of entities. A valid XML name composed of letters, numbers, hyphens, underscores, and colons. A space-separated list of names. The name of a DTD-specified notation, which describes a non-XML data format, such as those used for image files CS122B 31 Next: Parsing XML (SAX/DOM)

CS122B 32 What is an XML Parsing API? Programming model for accessing an XML document Sits on top of an XML parsing engine Language/platform independent CS122B 33 Java XML Parsing Specification

The Java XML Parsing Specification is a request to include a standardised way of parsing XML into the Java standard library The specification defines the following packages: CS122B

javax.xml.parsers org.xml.sax org.xml.sax.helpers org.w3c.dom The first is an all-new plugability layer, the others come from existing packages 34 Two ways of using XML parsers: SAX and DOM The Java XML Parsing Specification specifies two interfaces for XML parsers: Simple API for XML (SAX) is a flat, event-driven parser Document Object Model (DOM) is an object-oriented parser which

translates the XML document into a Java Object hierarchy CS122B 35 SAX Simple API for XML Event-based XML parsing API Not governed by any standards body Guy named David Megginson basically owns it SAX is simply a programming model that the developers of individual XML parsers implement SAX parser written in Java would expose the equivalent events

"serial access" protocol for XML CS122B 36 SAX (cont) A SAX parser reads the XML document as a stream of XML tags: starting elements, ending elements, text sections, etc. Every time the parser encounters an XML tag it calls a method in its HandlerBase object to deal with the tag. The HandlerBase object is usually written by the application programmer. The HandlerBase object is given as a parameter to the parse() method in the SAX parser. It includes all the code that defines what the XML tags actually

do. CS122B 37 How Does SAX work? XML Document SAX Objects Parser startDocument

Parser startElement John Doe Parser startElement & characters [email protected] Parser startElement & characters

Parser endElement Parser startElement Jane Doe Parser startElement & characters [email protected]

Parser startElement & characters Parser endElement Parser endElement & endDocument CS122B

38 SAX structure CS122B 39 SAX tutorial x/parsing.html CS122B 40 Document Object Model (DOM)

Most common XML parser API Tree-based API W3C Standard All DOM compliant parsers use the same object model CS122B 41 DOM (cont) A DOM parser is usually referred to as a document builder. It is not really a parser, more like a translator that uses a parser. In fact, most DOM implementations include a SAX parser within the document builder. A document builder reads in the XML document and outputs a hierarchy of Node

objects, which corresponds to the structure of the XML document. CS122B 42 How Does DOM work? XML Document DOM Objects Node

addressbook Node John Doe [email protected] Jane Doe [email protected] XML Parser Node person Node Name=John Doe

Node [email protected] person Node Name=John Doe Node [email protected] CS122B

43 DOM Structure Model and API hierarchy of Node objects: document, element, attribute, text, comment, ... language independent programming DOM API: get... first/last child, prev/next sibling, childNodes insertBefore, replace getElementsByTagName ...

Alternative event-based SAX API (Simple API for XML) does not build a parse tree (reports events when encountering begin/end tags) for (partially) parsing very large documents CS122B 44 DOM references Online tutorial: m/readingXML.html CS122B 45

A few functions Create a DOM document (tree) Document doc = builder.parse( new File(argv[0]) ); Remove those text nodes from those XML formatting spaces doc.normalize(); Generate a list of nodes for those link elements NodeList links = doc.getElementsByTagName("link"); W3C treats the text as a node getFirstChild().getNodeValue() CS122B 46

