XML
HTML is good for displaying documents for humans to read, but it is poor for describing semantics or allowing programs to search documents looking for information. Many people have written programs to do "screen scraping", parsing an HTML document to get information. This is hard to do and is usually based on assumptions about how an HTML document is configured.
For example, suppose that you wanted to write a program to do stock market analysis, and so each night you send queries to a server that sends you HTML files which contain the closing price, trading volume, etc for stocks. You have to write a program to extract the information that you want from the HTML, a non-trivial task. If the server changes the format of the HTML pages that it sends, you have to rewrite your program.
XML solves this problem. XML is a cross-platform, software and hardware independent tool for transmitting information. If the server sent an xml file (with a link to a separate style sheet file which had instructions for how to display the information), writing a program to extract the information that you want would be much easier.
XML stands for EXtensible Markup Language
XML is a markup language much like HTML. Both are descended from SGML (Standard Generalized Markup Language), invented at IBM in 1974.
HTML is about displaying information, while XML is about describing information.
XML does not do anything. It is a way of describing information
XML will enable smarter searches. A search for Chip will not get people named chip or chocolate chip cookies if you want computer chips.
Most people think that XML will be everywhere soon.
The syntax rules of XML are very simple and very strict. The rules are very easy to learn, and very easy to use.
<?xml version="1.0" encoding="ISO-8859-1"?>
<library>
<book isbn="123456">
<title>Unix Network Programming</title>
<author>W. Richard Stevens</author>
<publisher>Addison Wesley</publisher>
</book>
<book isbn="987654">
<title>Modern Operating Systems</title>
<author>Andrew S. Tanenbaum</author>
<publisher>Prentice Hall</publisher>
</book>
</library>
An xml document has these basic building blocks.
We will see other components of an xml document below, but these are the most important.
Differences between XML and HTML
Superficially, XML looks a lot like HTML. This is because they share a common ancestor, SGML. There are some important differences.
<title> <Title> and <TITLE>
are all different
Comments are the same as in HTML <!-- this is a comment -->
Don't put comments inside tags or before the prolog
An XML file must meet two criteria: it has to be well-formed and valid. Well-formed means that it corresponds to the rules outlined above. Browsers are pretty tolerant of poorly formed HTML, but XML documents have to be perfect.
Valid means that it conforms to a Document Type Definition (DTD). This is a template for the document.
These tell writers how to write documents (Example: the big news services have a DTD for news stories: anyone can write and submit a news story, but it has to conform to the DTD).
Here are some other examples.
A web site that searches the web for hard to find auto parts. They share a DTD with all of their suppliers.
Genealogical Markup Language (GedML) and the Chemical Markup Language (CML) are the examples of XML vocabularies. GedML describe ancestral data and CML describes chemical formulas and molecules.
XBRL (short for eXtensible Business Reporting Language) is an XML-based standard for handling corporate financial information.
XML is the basis for Web services like SOAP (Simple Object Access Protocol)
XHTML - a translation of HTML into XML
<b><i>This text is bold and italic</b></i> works in HTML
but not in xhtml
These are OK in HTML, but not in xhtml
This is a break<br> Here comes a horizontal rule:<hr> Here's an image <img src=''happy.gif'' alt=''Happy face''>
Here is the correct form for xhtml
This is a break<br /> Here comes a horizontal rule:<hr /> Here's an image <img src=''happy.gif'' alt=''Happy face'' />
There are some mandatory elements defined in the xhtml.dtd The root element is html
An XHTML document consists of three main parts (all required):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
There are three types STRICT TRANSITIONAL FRAMESET
The rules for writing a DTD
Here is a sample DTD for the library example
<!ELEMENT library (book*)> <!ELEMENT book (title, author, publisher)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ATTLIST book isbn CDATA #REQUIRED>The <!ELEMENT tag contains the name of the element followed by its content. This uses a regular-expression-like language. The asterisk indicates zero or more. The first line of the example says that a library consists of zero or more books.
A sequence of required elements is indicated by a list of element names delimited by commas. The second line of the example says that a book consists of one title, followed by one author, followed by one publisher.
An element which consists only of text is indicated by the keyword #PCDATA.
Attributes are indicated by an ATTLIST. This consists of the element name for which the attribute applies, the name of the attribute, the type of the attribute, (a string is indicated by the keyword CDATA), and then an optional keyword
The reference to the dtd in the xml document is done with the DOCTYPE element. If the dtd is in the same filesystem as the xml document, the DOCTYPE element has this syntax.
<!DOCTYPE root-element SYSTEM "filename">If our dtd for the example was in the file library.dtd, this line would be added as the second line of the xml document
<!DOCTYPE library SYSTEM "library.dtd">Here are some other operators.
The + indicates 1 or more
The ? indicates zero or 1
The | indicates choice (logical or)
here are some other keywords
EMPTY - an empty element
ANY - any content is acceptable
XML Schemas
DTDs have some limitations, for example consider this xml document.
<?xml version="1.0" ?> <country> <name>France</name> <population>59.7</population> </country>We could write a simple dtd for this.
<!ELEMENT country (name, population)> <!ELEMENT name (#PCDATA)> <!ELEMENT population (#PCDATA)>But there is no way that we can specify that population is a number
Here is a schema for this, country.xsd
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="country">
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="population" type="xs:decimal"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Note that this document is a true xml document, with the root element scheme.
Note also that it includes two different predefined data types, string and decimal
Here are some of the built-in data types
<xs:attribute name="lang" type="xs:string" use="required"/>
In schemas you can set minimum and maximum values
<xs:element name="age">
<xs:simpleType>
<xs:restriction base="xs:integer">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="100"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
You can do enumerated types
<xs:element name="car">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="Audi"/>
<xs:enumeration value="Golf"/>
<xs:enumeration value="BMW"/>
</xs:restriction>
</xs:simpleType>
</xs:element>
Name Spaces
In large documents there may be name clashes.
Consider these two xml fragments
<table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table>
<table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table>
There are two elements named table with different structures and different meanings. To solve this problem, use the namespace declaration.
<h:table xmlns:h="http://www.w3.org/TR/html4/"> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> <h/:tr> <h/:table> <f:table xmlns:f="http://www.w3schools.com/furniture"> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table>The statement xmlns:h declares a name space called h. The attribute value should be a unique value. Note that the xml processor does not ordinarily go to the url named in the attribute value; this is simply to insure uniqueness.
Two name spaces are defined, h and f. Each element name is preceded by its namespace delimited by a colon. This ensures that f:table and h:table are treated as different elements by the xml processor.
Required Reading
The W3School has some wonderful tutorials Here is their XML tutorial Go through the entire Basic Tutorial and the Advanced Tutorial on Namespaces.
This is not required reading, but
Sun Microsystems has a good xml tutorial as well