Required Reading
The Sun Microsystems DOM and JAXP tutorial.Part 1 Reading XML Data into a DOM only.
The w3schools has an excellent dom tutorial
The Document Object Model (DOM)
A language neutral and platform neutral way to parse xml (and html) documents.
The XML DOM defines a standard set of objects for XML with standard properties and methods, and a standard way to access and manipulate XML documents.
The basic model is a tree structure. All elements, attributes, text, and other components can be accessed.
Each of these is a node.
There are DOM parsing implementations in java, javascript, C++, C#, Perl etc.
For our demo, we will use the Java JAXP Dom parser. This defines a Node. Node is an interface, which means that it serves as a base class for other derived classes.
There are 12 classes derived from node, representing the various types of nodes. Here are some of the most common the node types
NamedNodeMap getAttributes() - returns a NamedNodeMap which is an array of all of the attributes of a node
NodeList getChildNodes() - returns an array of child nodes
node getFirstChild()
node getLastChild()
node getNextSibling()
node getParentNode()
There are three functions that you need to get the actual contents of a node.
int getNodeType() - returns the type of node as an int Here they are.
NodeType Named Constant 1 ELEMENT_NODE 2 ATTRIBUTE_NODE 3 TEXT_NODE 4 CDATA_SECTION_NODE 5 ENTITY_REFERENCE_NODE 6 ENTITY_NODE 7 PROCESSING_INSTRUCTION_NODE 8 COMMENT_NODE 9 DOCUMENT_NODE 10 DOCUMENT_TYPE_NODE 11 DOCUMENT_FRAGMENT_NODE 12 NOTATION_NODE
String getNodeName()
String getNodeValue()
What these return depends on the type of node.
| Interface | nodeName | nodeValue | attributes |
|---|---|---|---|
| Attr | name of attribute | value of attribute | null |
| CDATASection | "#cdata-section" |
content of the CDATA Section | null |
| Comment | "#comment" |
content of the comment | null |
| Document | "#document" |
null | null |
| DocumentFragment |
"#document-fragment" |
null | null |
| DocumentType | document type name | null | null |
| Element | tag name | null | NamedNodeMap |
| Entity | entity name | null | null |
| EntityReference | name of entity referenced | null | null |
| Notation | notation name | null | null |
| ProcessingInstruction | target | entire content excluding the target | null |
| Text |
"#text" |
content of the text node | null |
This should give you enough information to be able to traverse an xml document to find a particular piece of information.
There are also member functions which allow you to create a new xml document or modify an existing document. We will see these below.
Here is some skeleton code to create a Document from an xml file.
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( new File(argv[0]) );
// this could also read from a URL or any other stream
...
A DocumentBuilderFactory defines a factory API that enables applications to obtain a parser that produces DOM object trees from XML documents. This has a member function newDocumentBuilder that returns a DocumentBuilder.
The DocumentBuilder class has a member parse, which can take as an argument a filename, an inputstream or a string that represents a URL. This does the parsing (i.e. builds the tree). The DocumentBuilder class has a member isValidating(boolean). If the argument is true, then it validates the xml file against a dtd (the default is false).
This returns a Document called document
Here is a complete program which displays all of the nodes of an xml file. The children of a node are indented.
/* a dom parser that displays all data of an xml file */
import javax.xml.parsers.*;
import org.xml.sax.*;
import java.io.*;
import org.w3c.dom.*;
import java.lang.String;
public class DomParser{
public static void main(String argv[])
{
if (argv.length != 1) {
System.err.println("Usage: java DomParser filename");
System.exit(1);
}
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setValidating(true);
try {
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( new File(argv[0]) );
// this could also read from a URL or any other stream
Element root = document.getDocumentElement();
if (root != null) PrintTree(root,0);
} catch (SAXException sxe) {
// Error generated during parsing
System.err.println(sxe);
} catch (ParserConfigurationException pce) {
// Parser with specified options can't be built
System.err.println(pce);
} catch (IOException ioe) {
System.err.println(ioe);
}
} // end of main
// PrintTree - prints the dom tree, indenting each level
// an additional 4 spaces
private static void PrintTree(Element e, int indent)
{
int i;
for (i=0;i<indent;i++) System.out.print(" ");
System.out.println("Element Tag " + e.getNodeName());
Node child = e.getFirstChild();
while (child != null) {
int type;
type = child.getNodeType();
if (type == 1) { /* element */
PrintTree((Element)child,indent+4);
}
else if (type == 3) { /* text */
for (i=0;i<indent+4;i++) System.out.print(" ");
String s = child.getNodeValue();
System.out.println("Text: " + child.getNodeValue());
}
else if (type == 8) { /* comment */
for (i=0;i<indent+4;i++) System.out.print(" ");
System.out.println("Comment" + child.getNodeValue());
}
child = child.getNextSibling();
}
}
}
The main function is boilerplate. The function PrintTree goes through
the entire tree starting at the root. If a child is an element, the
function calls itself recursively. Otherwise it displays comments
and text.
The program was run with this file as input.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE library SYSTEM "library.dtd">
<library>
<!--- this is a comment -->
<book isbn="123456">
<title>Unix Network Programming</title>
<author>W. Richard Stevens</author>
<publisher>Addison Wesley</publisher>
</book>
<book isbn="987654">
<title>Modern Operating Systems</title>
<author>Andrew S. Tanenbaum</author>
<publisher>Prentice Hall</publisher>
</book>
</library>
Here was the output.
Element Tag library
Text:
Comment- this is a comment
Text:
Element Tag book
Text:
Element Tag title
Text: Unix Network Programming
Text:
Element Tag author
Text: W. Richard Stevens
Text:
Element Tag publisher
Text: Addison Wesley
Text:
Text:
Element Tag book
Text:
Element Tag title
Text: Modern Operating Systems
Text:
Element Tag author
Text: Andrew S. Tanenbaum
Text:
Element Tag publisher
Text: Prentice Hall
Text:
Text:
Notice that there are a lot of blank text children. This is because
all of the spaces are treated as children by the parser.
You can prevent this by adding this line
factory.setIgnoringElementContentWhitespace(true);
after you create the factory.
The output now looks like this:
Element Tag library
Comment- this is a comment
Element Tag book
Element Tag title
Text: Unix Network Programming
Element Tag author
Text: W. Richard Stevens
Element Tag publisher
Text: Addison Wesley
Element Tag book
Element Tag title
Text: Modern Operating Systems
Element Tag author
Text: Andrew S. Tanenbaum
Element Tag publisher
Text: Prentice Hall
You can download this program here
The Element class has a number of other member functions.
NodeList getElementsByTagName(String name) Returns a NodeList of all descendant of the element with a given tag name. This allows you to find elements with a particular tag name without traversing the entire tree.
A Nodelist is a list of Nodes. It only has two member function of interest int getLength() and Node item(int i)
String getAttribute(String AttrName) returns the value of an attribute.
NamedNodeMap getAttributes() returns all of the attributes of an element. This has member functions getLength() and item().
The above program traversed the tree by getting the first child
of each node, then getting the next sibling, etc. An alternative
way of doing the same thing is to use the function.
NodeList getChildNodes()
which returns a node list of all of the children. You can then
use the getLength and Item members to access these.
Here is another sample program which traverses the tree using this method, it displays the attributes, and it displays the number of children that each element has an the length of each string so you can see the exact structure.
The main program is identical to the first example. The only difference is in the PrintTree function.
/* another dom parser */
import javax.xml.parsers.*;
import org.xml.sax.*;
import java.io.*;
import org.w3c.dom.*;
import java.lang.String;
public class DomParser3{
public static void main(String argv[])
{
if (argv.length != 1) {
System.err.println("Usage: java DomParser filename");
System.exit(1);
}
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setValidating(true);
try {
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( new File(argv[0]) );
// this could also read from a URL or any other stream
Element root = document.getDocumentElement();
if (root != null) PrintTree(root);
} catch (SAXException sxe) {
// Error generated during parsing
System.err.println(sxe);
} catch (ParserConfigurationException pce) {
// Parser with specified options can't be built
System.err.println(pce);
} catch (IOException ioe) {
ioe.printStackTrace();
}
} // end of main
private static void PrintTree(Element e)
{
int i;
Node n,child;
NodeList children;
NamedNodeMap attributes;
attributes = e.getAttributes();
System.out.println("Element " + e.getNodeName() + " has " +
attributes.getLength() + " attributes");
for(i=0;i<attributes.getLength();i++) {
n = attributes.item(i);
System.out.println(n.getNodeName() + "=" + n.getNodeValue());
}
children = e.getChildNodes();
System.out.println("Element " + e.getNodeName() + " has " +
children.getLength() + " children");
for (i=0;i<children.getLength();i++) {
child = children.item(i);
if (child.getNodeType()==1) PrintTree((Element)child);
}
}
}
Here is the output when this was run with library.xml
Element library has 0 attributes Element library has 7 children Element book has 1 attributes isbn=123456 Element book has 7 children Element title has 0 attributes Element title has 1 children Element author has 0 attributes Element author has 1 children Element publisher has 0 attributes Element publisher has 1 children Element book has 1 attributes isbn=987654 Element book has 7 children Element title has 0 attributes Element title has 1 children Element author has 0 attributes Element author has 1 children Element publisher has 0 attributes Element publisher has 1 children
You can download this program here
Notice that the element book has seven children. There are the three obvious children, title, author and publisher, but also four text children, made up of the white spaces between these. If the xml file were all scrunched together without whitespaces, these would go away, and it would still be sytactically correct, but hard for humans to read.
You can also solve this problem (if you see it as a problem) with
this line, after you create the factory as discussed above
factory.setIgnoringElementContentWhitespace(true);
Using the Dom Parser to write an xml file.
There are also member functions which can be used to write an xml file or to modify a file. This is useful for converting data from a database into xml.
You can create a new document with the DocumentBuilder newDocument() member function. Once this has been done, Document has member functions such as the following.
Element CreateElement(String tagname)
Comment CreateComment(String commentstring)
Text CreateTextNode(String data)
The Node class has a member
Node appendChild(Node newChild) which adds a new child to a node
The Element class has a member
void setAttribute(String name, String value)
Using these, you can create a new xml document from scratch. Once this has been created, you can write it to a file like this.
Transformer t =
TransformerFactory.newInstance().newTransformer();
t.transform(new DOMSource(doc), new StreamResult(filename));
where doc is the Document and filename is the name of the file.
Here is a short program which demonstrates this.
/* a program to write an xml file */
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.xml.sax.*;
import java.io.*;
import org.w3c.dom.*;
import java.lang.String;
public class DomWriter{
public static void main(String argv[])
{
if (argv.length != 1) {
System.err.println("Usage: java DomWriter filename");
System.exit(1);
}
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
Element bookElement = doc.createElement("book");
Attr versionAttr = doc.createAttribute("version");
versionAttr.setValue("draft");
bookElement.setAttributeNode(versionAttr);
Element titleElement = doc.createElement("title");
Element authorElement = doc.createElement("author");
Element firstnameElement = doc.createElement("firstname");
Element lastnameElement = doc.createElement("lastname");
Text firstnameText = doc.createTextNode("Suzy");
Text lastnameText = doc.createTextNode("Creamcheese");
Text titleText = doc.createTextNode
("Network Programming for Dummies");
doc.appendChild(bookElement);
titleElement.appendChild(titleText);
bookElement.appendChild(titleElement);
bookElement.appendChild(authorElement);
authorElement.appendChild(firstnameElement);
authorElement.appendChild(lastnameElement);
firstnameElement.appendChild(firstnameText);
lastnameElement.appendChild(lastnameText);
Comment theComment = doc.createComment("This is a comment");
doc.appendChild(theComment);
Transformer t =
TransformerFactory.newInstance().newTransformer();
t.transform(new DOMSource(doc), new StreamResult(argv[0]));
}
catch (TransformerConfigurationException tce) {
tce.printStackTrace();
} catch (TransformerException tf) {
tf.printStackTrace();
} catch (ParserConfigurationException pce) {
pce.printStackTrace();
}
} // end of main
}
This program takes the name of the output file as an argument. The contents of the file look like this
<?xml version="1.0" encoding="UTF-8"?> <book version="draft"><title>Network Programming for Dummies</title><author><firstname>Suzy</firstname><lastname>Creamcheese</lastname></author></book><!--This is a comment-->Here is the file in a more human readable form
<?xml version="1.0" encoding="UTF-8"?>
<book version="draft">
<title>Network Programming for Dummies</title>
<author>
<firstname>Suzy</firstname>
<lastname>Creamcheese</lastname>
</author>
</book>
<!--This is a comment-->
There are also functions to modify an existing document Here are some examples
void removeAttribute(String name)
Node removeChild(Node oldchild)
Node replaceChild(Node newchild, Node oldchild)
SAX Parsing
A second way to parse an xml document is with SAX (Simple API for XML). SAX parsing is event-driven, serial-access mechanism for accessing XML documents, This is faster and less memory intensive than DOM, but harder to program, and does not allow you to back up.
Required Reading
The Sun Microsystems DOM and JAXP tutorial.Part 1 Reading XML Data into a DOM only.
The w3schools has an excellent dom tutorial