CSCI.4220 Network Programming
Class 14,
Parsing XML with DOM, XSLT

Required Reading

The Sun Microsystems DOM and JAXP tutorial.Part 1 Reading XML Data into a DOM only.

The w3schools has an excellent dom tutorial

The Document Object Model (DOM)

A language neutral and platform neutral way to parse xml (and html) documents.

The XML DOM defines a standard set of objects for XML with standard properties and methods, and a standard way to access and manipulate XML documents.

The basic model is a tree structure. All elements, attributes, text, and other components can be accessed.

Each of these is a node.

There are DOM parsing implementations in java, javascript, C++, C#, Perl etc.

For our demo, we will use the Java JAXP Dom parser. This defines a Node. Node is an interface, which means that it serves as a base class for other derived classes.

There are 12 classes derived from node, representing the various types of nodes. Here are some of the most common the node types

Here are some typical member properties and functions of a node that can be used to traverse through the tree.

NamedNodeMap getAttributes() - returns a NamedNodeMap which is an array of all of the attributes of a node

NodeList getChildNodes() - returns an array of child nodes

node getFirstChild()

node getLastChild()

node getNextSibling()

node getParentNode()

There are three functions that you need to get the actual contents of a node.

int getNodeType() - returns the type of node as an int Here they are.

NodeType 	Named Constant
1	ELEMENT_NODE
2	ATTRIBUTE_NODE
3	TEXT_NODE
4	CDATA_SECTION_NODE
5	ENTITY_REFERENCE_NODE
6	ENTITY_NODE
7 	PROCESSING_INSTRUCTION_NODE
8 	COMMENT_NODE
9 	DOCUMENT_NODE
10 	DOCUMENT_TYPE_NODE
11 	DOCUMENT_FRAGMENT_NODE
12 	NOTATION_NODE

String getNodeName()

String getNodeValue()

What these return depends on the type of node.

Interface nodeName nodeValue attributes
Attr name of attribute value of attribute null
CDATASection "#cdata-section" content of the CDATA Section null
Comment "#comment" content of the comment null
Document "#document" null null
DocumentFragment "#document-fragment" null null
DocumentType document type name null null
Element tag name null NamedNodeMap
Entity entity name null null
EntityReference name of entity referenced null null
Notation notation name null null
ProcessingInstruction target entire content excluding the target null
Text "#text" content of the text node null

This should give you enough information to be able to traverse an xml document to find a particular piece of information.

There are also member functions which allow you to create a new xml document or modify an existing document. We will see these below.

Here is some skeleton code to create a Document from an xml file.

DocumentBuilderFactory factory =
       DocumentBuilderFactory.newInstance();
try {
      DocumentBuilder builder = factory.newDocumentBuilder();
      Document document = builder.parse( new File(argv[0]) );
       // this could also read from a URL or any other stream
      ...

A DocumentBuilderFactory defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. This has a member function newDocumentBuilder that returns a DocumentBuilder.

The DocumentBuilder class has a member parse, which can take as an argument a filename, an inputstream or a string that represents a URL. This does the parsing (i.e. builds the tree). The DocumentBuilder class has a member isValidating(boolean). If the argument is true, then it validates the xml file against a dtd (the default is false).

This returns a Document called document

Here is a complete program which displays all of the nodes of an xml file. The children of a node are indented.

/* a dom parser that displays all data of an xml file */
import javax.xml.parsers.*; 
import org.xml.sax.*;
import java.io.*;
import org.w3c.dom.*;
import java.lang.String;

public class DomParser{
    public static void main(String argv[])
    {
        if (argv.length != 1) {
            System.err.println("Usage: java DomParser filename");
            System.exit(1);
        }
        DocumentBuilderFactory factory =
            DocumentBuilderFactory.newInstance();
	factory.setValidating(true);
        try {
           DocumentBuilder builder = factory.newDocumentBuilder();
           Document document = builder.parse( new File(argv[0]) );
	   // this could also read from a URL or any other stream
           Element root = document.getDocumentElement();
           if (root != null) PrintTree(root,0);
        } catch (SAXException sxe) {
           // Error generated during parsing
             System.err.println(sxe);
        } catch (ParserConfigurationException pce) {
            // Parser with specified options can't be built
             System.err.println(pce);
        } catch (IOException ioe) {
             System.err.println(ioe);
        }
    } // end of main

    // PrintTree - prints the dom tree, indenting each level 
    //    an additional 4 spaces
    private static void PrintTree(Element e, int indent) 
    {
	int i;
        for (i=0;i<indent;i++) System.out.print(" ");
        System.out.println("Element Tag " + e.getNodeName());
        Node child = e.getFirstChild();
        while (child != null) {
            int type;
            type = child.getNodeType();
	    if (type == 1) { /* element */
                      PrintTree((Element)child,indent+4);
	       }
            else if (type == 3) { /* text */
               for (i=0;i<indent+4;i++) System.out.print(" ");
               String s = child.getNodeValue();
               System.out.println("Text: " + child.getNodeValue());
	    }
            else if (type == 8) { /* comment */
		for (i=0;i<indent+4;i++) System.out.print(" ");
                System.out.println("Comment" + child.getNodeValue());
	    }
	    child = child.getNextSibling();
	}
    }
}
The main function is boilerplate. The function PrintTree goes through the entire tree starting at the root. If a child is an element, the function calls itself recursively. Otherwise it displays comments and text.

The program was run with this file as input.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE library SYSTEM "library.dtd">
<library>
  <!--- this is a comment --> 
   <book isbn="123456">
        <title>Unix Network Programming</title>
        <author>W. Richard Stevens</author>
        <publisher>Addison Wesley</publisher>
   </book>
   <book isbn="987654">
        <title>Modern Operating Systems</title>
        <author>Andrew S. Tanenbaum</author>
        <publisher>Prentice Hall</publisher>
   </book>
</library>
Here was the output.
Element Tag library
    Text:

    Comment- this is a comment
    Text:

    Element Tag book
        Text:

        Element Tag title
            Text: Unix Network Programming
        Text:

        Element Tag author
            Text: W. Richard Stevens
        Text:

        Element Tag publisher
            Text: Addison Wesley
        Text:

    Text:

    Element Tag book
        Text:

        Element Tag title
            Text: Modern Operating Systems
        Text:

        Element Tag author
            Text: Andrew S. Tanenbaum
        Text:

        Element Tag publisher
            Text: Prentice Hall
        Text:

    Text:
Notice that there are a lot of blank text children. This is because all of the spaces are treated as children by the parser.

You can prevent this by adding this line
factory.setIgnoringElementContentWhitespace(true);
after you create the factory.

The output now looks like this:

Element Tag library
    Comment- this is a comment
    Element Tag book
        Element Tag title
            Text: Unix Network Programming
        Element Tag author
            Text: W. Richard Stevens
        Element Tag publisher
            Text: Addison Wesley
    Element Tag book
        Element Tag title
            Text: Modern Operating Systems
        Element Tag author
            Text: Andrew S. Tanenbaum
        Element Tag publisher
            Text: Prentice Hall

You can download this program here

The Element class has a number of other member functions.

NodeList getElementsByTagName(String name) Returns a NodeList of all descendant of the element with a given tag name. This allows you to find elements with a particular tag name without traversing the entire tree.

A Nodelist is a list of Nodes. It only has two member function of interest int getLength() and Node item(int i)

String getAttribute(String AttrName) returns the value of an attribute.

NamedNodeMap getAttributes() returns all of the attributes of an element. This has member functions getLength() and item().

The above program traversed the tree by getting the first child of each node, then getting the next sibling, etc. An alternative way of doing the same thing is to use the function.
NodeList getChildNodes()
which returns a node list of all of the children. You can then use the getLength and Item members to access these.

Here is another sample program which traverses the tree using this method, it displays the attributes, and it displays the number of children that each element has an the length of each string so you can see the exact structure.

The main program is identical to the first example. The only difference is in the PrintTree function.

/* another dom parser */
import javax.xml.parsers.*; 
import org.xml.sax.*;
import java.io.*;
import org.w3c.dom.*;
import java.lang.String;

public class DomParser3{
    public static void main(String argv[])
    {
        if (argv.length != 1) {
            System.err.println("Usage: java DomParser filename");
            System.exit(1);
        }
        DocumentBuilderFactory factory =
            DocumentBuilderFactory.newInstance();
	factory.setValidating(true);
        try {
           DocumentBuilder builder = factory.newDocumentBuilder();
           Document document = builder.parse( new File(argv[0]) );
	   // this could also read from a URL or any other stream
           Element root = document.getDocumentElement();
           if (root != null) PrintTree(root);
        } catch (SAXException sxe) {
           // Error generated during parsing
             System.err.println(sxe);
        } catch (ParserConfigurationException pce) {
            // Parser with specified options can't be built
             System.err.println(pce);
        } catch (IOException ioe) {
            ioe.printStackTrace();
        }
    } // end of main


    private static void PrintTree(Element e) 
    {
	int i;
        Node n,child;
        NodeList children;
        NamedNodeMap attributes;
       
        attributes = e.getAttributes();
        System.out.println("Element " + e.getNodeName() + " has " +
			   attributes.getLength() + " attributes");
        for(i=0;i<attributes.getLength();i++) {
            n = attributes.item(i);
            System.out.println(n.getNodeName() + "=" + n.getNodeValue());
        }
        children = e.getChildNodes();
        System.out.println("Element " + e.getNodeName() + " has " +
			   children.getLength() + " children");
        for (i=0;i<children.getLength();i++) {
	    child = children.item(i);
            if (child.getNodeType()==1) PrintTree((Element)child);
	}
    }
     
}

Here is the output when this was run with library.xml

 
Element library has 0 attributes
Element library has 7 children
Element book has 1 attributes
isbn=123456
Element book has 7 children
Element title has 0 attributes
Element title has 1 children
Element author has 0 attributes
Element author has 1 children
Element publisher has 0 attributes
Element publisher has 1 children
Element book has 1 attributes
isbn=987654
Element book has 7 children
Element title has 0 attributes
Element title has 1 children
Element author has 0 attributes
Element author has 1 children
Element publisher has 0 attributes
Element publisher has 1 children

You can download this program here

Notice that the element book has seven children. There are the three obvious children, title, author and publisher, but also four text children, made up of the white spaces between these. If the xml file were all scrunched together without whitespaces, these would go away, and it would still be sytactically correct, but hard for humans to read.

You can also solve this problem (if you see it as a problem) with this line, after you create the factory as discussed above
factory.setIgnoringElementContentWhitespace(true);

Using the Dom Parser to write an xml file.

There are also member functions which can be used to write an xml file or to modify a file. This is useful for converting data from a database into xml.

You can create a new document with the DocumentBuilder newDocument() member function. Once this has been done, Document has member functions such as the following.

Element CreateElement(String tagname)

Comment CreateComment(String commentstring)

Text CreateTextNode(String data)

The Node class has a member

Node appendChild(Node newChild) which adds a new child to a node

The Element class has a member

void setAttribute(String name, String value)

Using these, you can create a new xml document from scratch. Once this has been created, you can write it to a file like this.

           
      Transformer t = 
            TransformerFactory.newInstance().newTransformer();
      t.transform(new DOMSource(doc), new StreamResult(filename));
where doc is the Document and filename is the name of the file.

Here is a short program which demonstrates this.

/* a program to write an xml file */
import javax.xml.parsers.*;
import javax.xml.transform.*; 
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.xml.sax.*;
import java.io.*;
import org.w3c.dom.*;
import java.lang.String;

public class DomWriter{
  public static void main(String argv[])
  {
    if (argv.length != 1) {
        System.err.println("Usage: java DomWriter filename");
        System.exit(1);
    }
    DocumentBuilderFactory factory =
        DocumentBuilderFactory.newInstance();
    try {
      DocumentBuilder builder = factory.newDocumentBuilder();
      Document doc = builder.newDocument();
      Element bookElement = doc.createElement("book");
      Attr versionAttr = doc.createAttribute("version");
      versionAttr.setValue("draft");
      bookElement.setAttributeNode(versionAttr);
      Element titleElement = doc.createElement("title");
      Element authorElement = doc.createElement("author");
      Element firstnameElement = doc.createElement("firstname");
      Element lastnameElement = doc.createElement("lastname");
      Text firstnameText = doc.createTextNode("Suzy");
      Text lastnameText = doc.createTextNode("Creamcheese");
      Text titleText = doc.createTextNode
          ("Network Programming for Dummies");
      doc.appendChild(bookElement);
      titleElement.appendChild(titleText);
      bookElement.appendChild(titleElement);
      bookElement.appendChild(authorElement);
      authorElement.appendChild(firstnameElement);
      authorElement.appendChild(lastnameElement);
      firstnameElement.appendChild(firstnameText);
      lastnameElement.appendChild(lastnameText);
      Comment theComment = doc.createComment("This is a comment");
      doc.appendChild(theComment);
           
      Transformer t = 
            TransformerFactory.newInstance().newTransformer();
      t.transform(new DOMSource(doc), new StreamResult(argv[0]));
    }
       catch (TransformerConfigurationException tce) {
	    tce.printStackTrace();
        } catch (TransformerException tf) {
            tf.printStackTrace();
	} catch (ParserConfigurationException pce) {
            pce.printStackTrace();
	}
    } // end of main
}

This program takes the name of the output file as an argument. The contents of the file look like this

<?xml version="1.0" encoding="UTF-8"?>
<book version="draft"><title>Network Programming for Dummies</title><author><firstname>Suzy</firstname><lastname>Creamcheese</lastname></author></book><!--This is a comment--> 
Here is the file in a more human readable form
<?xml version="1.0" encoding="UTF-8"?>
<book version="draft">
     <title>Network Programming for Dummies</title>
     <author>
           <firstname>Suzy</firstname>
           <lastname>Creamcheese</lastname>
     </author>
</book>
<!--This is a comment-->

There are also functions to modify an existing document Here are some examples

void removeAttribute(String name)
Node removeChild(Node oldchild)
Node replaceChild(Node newchild, Node oldchild)

SAX Parsing

A second way to parse an xml document is with SAX (Simple API for XML). SAX parsing is event-driven, serial-access mechanism for accessing XML documents, This is faster and less memory intensive than DOM, but harder to program, and does not allow you to back up.

Required Reading

The Sun Microsystems DOM and JAXP tutorial.Part 1 Reading XML Data into a DOM only.

The w3schools has an excellent dom tutorial