EIW Fall 2003 Lecture Notes

XML and XHTML


Markup languages

A markup language is a set of rules that define the structure of an electronic document. It is important to establish such languages (rules) so that computer programs can make sense out of electronic documents. By "make sense", I mean that programs can display, generate and extract information from documents. Unless the programmer knows what kind of document his program should deal with, it would be impossible for the programmer to create a program that could do a good job at displaying or generating documents.

SGML: Standardized Generalized Markup Language

SGML is a meta markup language, it's a markup language that is used to describe other markup languages. SGML is very useful, although very complex. HTML (4.01) is defined using SGML, as are many other markup languages. HTML is probably one of the simpler languages ever defined using SGML, HTML covers a "very simple class of report-style documents" (quoted from the XML FAQ). The "simplicity" of HTML does not imply anything about the range of content possible, but rather about the document structure. HTML supports a fixed, pre-defined set of tags. You can't make up your own tags - you have to use only those tags that are supported by HTML or programs won't make sense out of your documents.

XML:Extensible Markup Language

XML is also a meta markup language, although much simpler than SGML (also less powerful). XML was created in part, to provide some formal mechanism for extending HTML without needing to use SGML. Using XML anyone can create their own markup language (tags) and definitions of how browsers should display these documents. A number of organizations have created new markup languages that are widely used, such as CML: Chemical Markup Language, MathML: Mathematical Markup Language and VML: Vector Markup Language.

XML Documents

An XML document must be structured according to the rules that define what markup is allowed and the structure of the markup. Like HTML, an XML document is a text document that contains plain text and markup. The markup is composed of tags and XML declarations. There is the notion of a "well-formed XML document" - this is a document that obeys all the rules, there are various programs (called "validators") that can be used to determine whether a document is "well-formed" or not.

The "rules" for document/markup structure are defined in a Document Type Definition (DTD). There is a DTD that defines the rules for XML documents, in addition you can create your own DTD that defines your own XML document type (although it will still be subject to the rules of the XML DTD).

. Below are informal descriptions of some of the "rules" that define what is a valid XML document:

If you create your own set of rules (DTD), you have created your own markup language, although yours is a subset of XML. The example mentioned above (CML, MathML) are examples. So - a MathML document is an XML document that is subject to the rules of the MathML DTD, these rules define the set of acceptable tag names, attribute values and the overall structure of a document.

You don't have to create a DTD to use XML, in this case your document is "just" an XML document (not a pre-defined special kinds of document, like a MathML or CML document...).

Sample XML Document

Below is a sample XML document (without any reference to a DTD). This document might represent part of a database of student records. The document itself contains a information related to a single student

<?xml version="1.0"?>

<student>
  <rid>660012345</rid>
  <first>Joe</first>
  <middle>X.</first>
  <last>Smith</last>
  <courses semester="fall03">
    <course>
      <name>Exploiting the Information World</name>
      <crn>12345</crn>
      <num>ITEC-2110</num>
    </course>

    <course>
     <name>XML DTD Creation</name>
     <crn>82828</crn>
    </course>
  </courses>

  <address>123 main street</address>
  <phone>555-2929</phone>
  <im>jsiscool</im>
</student>

You can load this document in your browser to see what your browser does with XML: stu.xml

CSS and XML

You can create CSS rules that tell the browser how you want to display the content of each XML tag. Here are some rules for the above XML document:

student { display:block;
          border: solid 2px black;
          background-color:wheat;
          width:50% }

first,last,middle { border:2; 
          font-family:sans-serif;
          font-weight: bold; }

address,phone,im { display:block; 
          font-family:sans-serif;
          font-weight: bold;
          display:block }

name { font-family:sans-serif;
          color:green;
          display: block; 
          font-style:italic}

courses { margin-left: .3in;
          display: block;}

course {margin-top:.2in;
          display:block;
          border:solid 1px gray}
	
rid,crn { display: block; font-size: 14pt;
          font-weight: bold;
          font-family:Courier; 
          color: blue; }

You can load this version of the document in your browser to see what your browser does with the same XML file when associated with the above CSS file: stucss.xml

XSL: Extensible Stylesheet Language

Using CSS with and XML file is one way of telling a browser how to render the XML document. CSS is very limited, as all you can specify are rules for colors, fonts, position, etc - you can't rearrange the document at all or do things like create a table of contents (in fact you can't create any content with CSS rules).

XML includes a completely different mechanism for specifying style, you can create an XSL stylesheet that includes transformations that allow you to rearrange and/or add content. An XSL stylesheet is itself an XML document, the individual elements in this document define transformation rules that are applied to the various elements of the original XML document. The result can be another XML document or could be an HTML document (actually XHTML).

The definition of an XSLT (XSL Transformation) stylesheet is a little like writing a program, although the language used is based on XML. An example is provided below, but don't worry if it doesn't make sense (we are not treating XSLT as a topic to explore in this course) - the general idea of "what is possible" is the important thing.

Here is that actual XSLT stylesheet that is used by the browser to transform the stu.xml file into HTML:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<xsl:output method="html" />

<xsl:template match="student">
  <html>
    <head>
     <title>Student Record</title>
    </head>
    <style>
      p {font-family:arial,sans-serif};
    </style>
    <body>
      <h1 align="center">Student Record for 
          <xsl:value-of select="first" /> <xsl:text> </xsl:text>
          <xsl:value-of select="middle" /><xsl:text> </xsl:text> 
          <xsl:value-of select="last" /> 
      </h1>
      <hr />

      <p>Name: <xsl:value-of select="first" /> <xsl:text> </xsl:text>
          <xsl:value-of select="middle" /> <xsl:text> </xsl:text>
          <xsl:value-of select="last" /> </p>

      <p>RPI ID#: <xsl:value-of select="rid" /></p>
      <p>Address: <xsl:value-of select="address" /></p>
      <p>Phone: <xsl:value-of select="phone" /></p>
      <p>Instant Messenger ID: <xsl:value-of select="im" /></p>
      <xsl:apply-templates select="courses" />
     </body> 
   </html>
</xsl:template>

<xsl:template match="courses">
  <p>Courses for <xsl:value-of select="@semester"  />:</p>
   <xsl:for-each select="course">
     <div style="margin-left:.3in">
     <p>Name: <xsl:value-of select="name" /><br  />
        CRN:  <xsl:value-of select="crn" /><br  />
        Course #:  <xsl:value-of select="num" />
     </p>
     </div>
   </xsl:for-each>      
</xsl:template>

</xsl:stylesheet>

Here is a link to the file that includes a reference to the above XSL stylesheet: stuxsl.xml.

XHTML: Extensible Hypertext Markup Language

XHTML is an XML-ized version of HTML 4.0. The general idea is to develop an XML markup language using that matches HTML as closely as possible. The W3C (which is responsible for the standards for HTML, XML and XHTML) hope that XHTML replaces HTML as the commonly used language on the WWW. XHTML makes it possible for programs that deal with XML to handle web pages as well. HTML is not XML, but XHTML is.

XHTML document types

The standard for XHTML includes 3 different DTDs (each describing a slightly different markup language). When you create an XHTML document you must select the appropriate DTD and then obey the rules enforced by that DTD. The three DTDs are:

  1. strict: The document must not include any tags or attributes that have been deprecated in HTML 4.01, for example you can't use FONT tags or the ALIGN attribute (these need to be specified using CSS).

  2. transitional: The document type definition supports the common usage HTML 4.01, so old tags and attributes that are still supported by browsers (but are not part of the actual HTML 4.0 standard) are OK. This is typically the DTD you would use to create XHTML documents.

  3. frames: The document type definition supports everything supported by the transitional DTD, and also supports frames.

XHTML documents are XML, so they must include information about what version of XML is being used (1.0 is the current version), and what DTD to use. Here is what the top of your documents should look like:

<?xml version="1.0">
<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11-transitional.dtd">

XHTML Documents

XHTML documents looks very much like HTML documents, the primary differences are that the XML and DTD specifications (shown above), and the additional rules on tag structure imposed by XML. The basic rules are:

The Future of HTML, XHTML and XML

It's hard to predict what will happen in the future, but it is generally agreed that learning/using XHTML is not a waste of time. It is probably true that HTML documents will be around for a very long time, but it is certainly possible that the more formal structure of XHTML documents will be used by many programs other than browsers (it is much easier to write a program that can handle XHTML than having to support all the oddities of HTML).

XML is widely accepted and used as an intermediate representation for data. Lots of systems can import and export data in XML format - this makes the exchange of data between application and across networks simple and easy to implement (there are lots of XML libraries for lots of programming languages). It's not clear whether XML will ever replace HTML/XHTML as the language used to describe web documents. XHTML has some room for expansion (you can develop "modules" that extend XHTML, for example there is a MathML module), it appears likely that this will provide the mechanism for extending web pages in the future rather than XML.