| EIW Fall 2003 Lecture Notes |
|   EIW Home  |   Course Syllabus |
A markup language is a set of rules that define the structure of an electronic document. It is important to establish such languages (rules) so that computer programs can make sense out of electronic documents. By "make sense", I mean that programs can display, generate and extract information from documents. Unless the programmer knows what kind of document his program should deal with, it would be impossible for the programmer to create a program that could do a good job at displaying or generating documents.
SGML is a meta markup language, it's a markup language that is used to describe other markup languages. SGML is very useful, although very complex. HTML (4.01) is defined using SGML, as are many other markup languages. HTML is probably one of the simpler languages ever defined using SGML, HTML covers a "very simple class of report-style documents" (quoted from the XML FAQ). The "simplicity" of HTML does not imply anything about the range of content possible, but rather about the document structure. HTML supports a fixed, pre-defined set of tags. You can't make up your own tags - you have to use only those tags that are supported by HTML or programs won't make sense out of your documents.
XML is also a meta markup language, although much simpler than SGML (also less powerful). XML was created in part, to provide some formal mechanism for extending HTML without needing to use SGML. Using XML anyone can create their own markup language (tags) and definitions of how browsers should display these documents. A number of organizations have created new markup languages that are widely used, such as CML: Chemical Markup Language, MathML: Mathematical Markup Language and VML: Vector Markup Language.
An XML document must be structured according to the rules that define what markup is allowed and the structure of the markup. Like HTML, an XML document is a text document that contains plain text and markup. The markup is composed of tags and XML declarations. There is the notion of a "well-formed XML document" - this is a document that obeys all the rules, there are various programs (called "validators") that can be used to determine whether a document is "well-formed" or not.
The "rules" for document/markup structure are defined in a Document Type Definition (DTD). There is a DTD that defines the rules for XML documents, in addition you can create your own DTD that defines your own XML document type (although it will still be subject to the rules of the XML DTD).
. Below are informal descriptions of some of the "rules" that define what is a valid XML document:/ before the > that marks
the end of the tag. For example: <BR/>.
If you create your own set of rules (DTD), you have created your own markup language, although yours is a subset of XML. The example mentioned above (CML, MathML) are examples. So - a MathML document is an XML document that is subject to the rules of the MathML DTD, these rules define the set of acceptable tag names, attribute values and the overall structure of a document.
You don't have to create a DTD to use XML, in this case your document is "just" an XML document (not a pre-defined special kinds of document, like a MathML or CML document...).
Below is a sample XML document (without any reference to a DTD). This document might represent part of a database of student records. The document itself contains a information related to a single student
|
You can load this document in your browser to see what your browser does with XML: stu.xml
You can create CSS rules that tell the browser how you want to display the content of each XML tag. Here are some rules for the above XML document:
|
You can load this version of the document in your browser to see what your browser does with the same XML file when associated with the above CSS file: stucss.xml
Using CSS with and XML file is one way of telling a browser how to render the XML document. CSS is very limited, as all you can specify are rules for colors, fonts, position, etc - you can't rearrange the document at all or do things like create a table of contents (in fact you can't create any content with CSS rules).
XML includes a completely different mechanism for specifying style, you can create an XSL stylesheet that includes transformations that allow you to rearrange and/or add content. An XSL stylesheet is itself an XML document, the individual elements in this document define transformation rules that are applied to the various elements of the original XML document. The result can be another XML document or could be an HTML document (actually XHTML).
The definition of an XSLT (XSL Transformation) stylesheet is a little like writing a program, although the language used is based on XML. An example is provided below, but don't worry if it doesn't make sense (we are not treating XSLT as a topic to explore in this course) - the general idea of "what is possible" is the important thing.
Here is that actual XSLT stylesheet that is used by the browser to transform the stu.xml file into HTML:
|
Here is a link to the file that includes a reference to the above XSL stylesheet: stuxsl.xml.
XHTML is an XML-ized version of HTML 4.0. The general idea is to develop an XML markup language using that matches HTML as closely as possible. The W3C (which is responsible for the standards for HTML, XML and XHTML) hope that XHTML replaces HTML as the commonly used language on the WWW. XHTML makes it possible for programs that deal with XML to handle web pages as well. HTML is not XML, but XHTML is.
The standard for XHTML includes 3 different DTDs (each describing a slightly different markup language). When you create an XHTML document you must select the appropriate DTD and then obey the rules enforced by that DTD. The three DTDs are:
strict: The document must not include any tags or
attributes that have been deprecated in HTML 4.01, for example you can't
use FONT tags or the ALIGN attribute (these need
to be specified using CSS).
transitional: The document type definition supports the common usage HTML 4.01, so old tags and attributes that are still supported by browsers (but are not part of the actual HTML 4.0 standard) are OK. This is typically the DTD you would use to create XHTML documents.
frames: The document type definition supports everything supported by the transitional DTD, and also supports frames.
XHTML documents are XML, so they must include information about what version of XML is being used (1.0 is the current version), and what DTD to use. Here is what the top of your documents should look like:
|
XHTML documents looks very much like HTML documents, the primary
differences are that the XML and DTD specifications (shown above), and
the additional rules on tag structure imposed by XML. The basic rules
are: Nesting of tags must be correct. For example, the following is
not allowed:
The above will work in an HTML document, but not XHTML since the nesting of
the
All start tags must have a corresponding end tag. The following is
not legal in XHTML (although it's fine in HTML):
The problem is that the
Some HTML elements do not actually have any content. For example the
line break tag
All attributes must have a value, and the value must be
quoted. HTML supports attributes with no value (for example the
Here is the corresponding XHTML:
It's hard to predict what will happen in the future, but it is
generally agreed that learning/using XHTML is not a waste of time. It
is probably true that HTML documents will be around for a very long
time, but it is certainly possible that the more formal structure of
XHTML documents will be used by many programs other than browsers (it is
much easier to write a program that can handle XHTML than having to support
all the oddities of HTML). XML is widely accepted and used as an intermediate representation
for data. Lots of systems can import and export data in XML format -
this makes the exchange of data between application and across
networks simple and easy to implement (there are lots of XML libraries
for lots of programming languages). It's not clear whether XML will ever
replace HTML/XHTML as the language used to describe web documents.
XHTML has some room for expansion (you can develop "modules" that extend
XHTML, for example there is a MathML module), it appears likely that this
will provide the mechanism for extending web pages in the future rather than
XML.XHTML Documents
<b><i>This is a bold, italic sentence.</b></i>
b and i tags is not correct. A correct version
would look like this:
<b><i>This is a bold, italic sentence.</i></b>
<p>here is my list:</p>
<ul>
<li>one
<li>two
<li>three
</ul>
LI tags have no corresponding end tag.
Here is a valid XHTML version of the above:
<p>here is my list:</p>
<ul>
<li>one</li>
<li>two</li>
<li>three</li>
</ul>
<BR> tells the browser to start a new line, but
you never see anything like this: <BR>Hi Dave</BR>, the BR document element was not intended to contain any text. Since every start tag
must have an end tag, in XML you would need to use something like:
<BR></BR>. Fortunately there is a shorthand notation for
empty elements like this, you use a single tag that is both start and end tag:
<BR/>. This applies to many commonly used HTML tags, including
IMAGE, LINK, FRAME and HR. Here is
what part of a valid XHTML document should look like that includes some of these kinds of elements:
<p>Here are some images:</p>
<img src="foo.gif />
<br />
<hr />
<img src="another.jpg" />
<br />
CHECKED attribute of a radio form field), and doesn't care if
you quote values unless there is a space in the value. Here is some valid
HTML that won't work in XHTML:
<form method=GET action=foo.cgi>
red: <input type=radio name=color value="red"><br>
green: <input type=radio checked name=color value="green"><br>
blue: <input type=radio name=color value=blue><br>
</form>
<form method="GET" action="foo.cgi">
red: <input type="radio" name="color" value="red"><br />
green: <input type="radio" checked="checked" name="color" value="green"><br />
blue: <input type="radio" name="color" value="blue"><br />
</form>
The Future of HTML, XHTML and XML