EIW Fall 2004 Lecture Notes

The World Wide Web and HTML


Hypertext History

In 1945 Vannevar Bush published an essay titled "As We May Think" in Atlantic Monthly that described the idea of linking documents together to make it easier to keep track of relationships between documents. Although Bush's description was primarily that of a "personal system" (instead of a global system linking documents from many sources), it is often credited as being the earliest description of what we now call hypertext. The term "hypertext" was coined by Ted Nelson in 1965 who went on to provide (along with Douglas Englebart -the inventor of the mouse) a crude implementation of hypertext. Nelson went on to describe and design a system called "Xanadu" that was to be used to put the entire literary content contained in the world online. Work on Xanadu was started in 1979 and still continues...

HTML and HTTP history

In 1989, while working at the European Particle Physics Lab , Tim Berners-Lee designed a system that would allow scientists to easily share scientific findings over the Internet. The initial implementation included a text-mode browser and a browser written for the NEXTStep operating system that provided access to hypertext files based on HTML and to USENET new groups. HTML (Hypertext Markup Language) was developed by Berners-Lee as a subset of SGML (Standard Generalized Markup Language). The protocol for retrieval of HTML documents was named HTTP (HyperText Transfer Protocol).

In 1993 the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign developed browser named Mosaic that run under Unix systems running X-Windows (they also developed a version for the Macintosh). The staff at NCSA also provided an extended version of HTML that included support for images (the IMG tag). Although the IMG tag was eventually incorporated in to HTML, the programmers at NCSA didn't wait for the HTML standards organization (headed by Berners-Lee) to incorporate the change in to the standard. Netscape (which was co-founded by the principals at NCSA) and Microsoft continue this practice of supporting additions to HTML before standards committees get around to formalizing updates.

HTML

From the book HTML The Definitive Guide (page 8):

HTML is a document-layout and hyperlink-specification language. It defines the syntax and placement of special, embedded directions that aren't displayed by the browser, but tell it how to display the contents of the document, including text, images, and other support media. The language also tells you how to make a document interactive through special hypertext links, which connect your document with other documents -- on either your computer or someone else's, as well as with other Internet resources, like FTP.

It is important to notice that HTML is not a word processing tool, it is a means of describing the structure of documents. By defining the structure of a document it is possible to have a browser decide what is the most appropriate way to render the document contents. Although HTML includes features that are related to page layout (frames, tables, etc), you should keep in mind that HTML meant to reflect document structure more than appearance. It is often possible to tune an HTML document to a particular browser to get a desired effect, only to find out that the page looks completely different when rendered by a different browser (or with a different window width, etc).

HTML Tags

HTML documents contain content that is to be displayed and tags that define the structure of the document (and in a few cases to specify formatting instructions). These tags are used by a browser to decide how to display the content, they are not displayed by the browser. HTML documents are simple text files that can be created with any text editor, the tags are just special sequences that are interpreted by the browser. If you want to create a document that includes some content that looks like HTML tags you need to do something special! (more on this later).

HTML tags are always bracketed within a less-than (<) and greater-than (>) character. Every tag has a name that indicates to the browser some information about document structure, and some tags can have attributes that provide additional information to the browser.

Start and End tags

Most HTML tags are used to mark the beginning and end of a region of a document (for example the beginning and end of a paragraph). These tags always come in pairs in a document - the leading tag is called the start tag and the trailing tag is called the end tag. Both tags have the same name, the only syntactic difference is that the end tag includes a "/" before the tag name. Here is an example that uses the P tag to mark the beginning and end of a paragraph:

<p>Here is a paragraph.
This paragraph is not very interesting.
This paragraph has a start tag and an end tag.
This paragraph has a bunch of lines. 

I can put a blank line in the middle of a paragraph, but whitespace
is removed by the browser - so everything between the tags is rendered 
as a single paragraph.
</p>

NOTE: In these lecture notes boxes like the one shown above contain HTML code as you would type it using a text editor to create a web page. Boxes inside a narrow dark border show what the HTML code will look like when the browser renders the HTML document.

When rendered by a browser, the above HTML will look like this:

Here is a paragraph. This paragraph is not very interesting. This paragraph has a start tag and an end tag. This paragraph has a bunch of lines. I can put a blank line in the middle of a paragraph, but whitespace is removed by the browser - so everything between the tags is rendered as a single paragraph.

HTML Document Structure

Every HTML document should start with the tag <HTML> and end with the tag </HTML>, this tells the browser that this is an HTML document. Although these tags are required, most browsers work fine without them (although this may change!).

Each HTML document includes a head and a body. The head includes information about the document (possibly the title, author, date of creation, software used to create the document) and the body contains the content of the document. There are tags used to identify these sections:

<HEAD> </HEAD> these tags surround the head of the document and come first (defore the body tags).

<BODY> </BODY> these tags surround the content of the document.

The head and body tags are actually required by the latest version of HTML, although most browsers will work fine without them (they interpret everything as the body of the document). It is possible that in the future all documents will need to have an explicit head, so it's best to use the head and body tags on anything you create.

Within the document head there is one required header tag - the <TITLE>,</TITLE> field. Within the title tags the document should contain a document title - this title is typically shown in the title bar of the browser window. Document titles should convey something useful about the content of the document.

These required tags now give us the following document structure:

<HTML>
<HEAD>
<TITLE> Document Title Goes Here </TITLE>
</HEAD>

<BODY>
document body goes here
</BODY>
</HTML>

Most HTML tags can include modifiers called attributes that provide the browser with additional information about how to render the document. These attributes are included as name=value pairs within the start tag. For example, a very useful attribute to the <BODY> tag is the attribute BGCOLOR. When found in a BODY tag, the BGCOLOR tag specifies the background color for the entire document. For example, the tag <BODY BGCOLOR=PINK> tells the browser that we are starting the body of the document and that when rendered it should be done with a pink background. There are lots of different attributes that can be used in the BODY tag including ones to tell the browser about document margins, default text color, link color and actions to take when various mouse events are detected. See any HTML reference for details (for example: The HTML 4.0 specification).

Formatting tags: B and I

Although HTML is a document structing language, there are a few tags that convey specific formatting information to the browser. For example, the tag <B> is used to turn on boldface. To turn off boldface you use the end tag </B>. These tags are embedded within the document content and used by the browser as control information. Here is part of an HTML document that includes a word in boldface:

Hello <B>World</B>

When rendered by a browser this document might look like this (notice the word "world" is shown in boldface):

Hello World

Similarly, the <I> and </I> tags are used to indicate italics:

Hello <I>Cruel</I> <B>World</B>

would look like this:

Hello Cruel World

Many tags are applied to some region of a document, the tags for bold and italics only effect the document content found between the start tag ( <B> or <I>) and the end tag ( </B> or </I>). The start tag turns on some attribute and the end tag turns it off. In general HTML end tags look just like the corresponding start tags with the addition of the "/".

The bold and italics tags are examples of tags that tell the browser how to render part of a document. Most of the tags supported by HTML tell the browser about the contents of a region and let the browser decide what attribute(s) to apply when rendering text. For example, the <EM> & </EM> tags tell the browser to emphasize a region, although they don't explicitly tell the browser how to do this. Typically a browser will put everything between <EM> and </EM> tags in italics, so (for now) the result is usually the same as using an italics tag. However, using the italics tag takes control away from the browser (and the browser user). Browsers allow users to establish how they would like to view emphasized text, but anything marked to be rendered in italics always means use a slanted font.

To understand the difference between using an italics tag and an emphasis tag, consider a blind computer user that has voice generation software that can read web pages out loud. When this software reads a section of text that is marked to be in italics the best (correct) thing the software could do it to state that the next sentence or word is in italics. For those sections that are tagged as emphasized, the software could change the pitch of the speaking voice, change to a different voice, or any other option that the user desires. Keep in mind that HTML was originally designed to communicate the structure of documents - using an <EM> tag indicates something about structure (a section is important and should be emphasized), the <I> tag indicates something about the authors personal preference as to how a section should look.

One final thought about structuring tags (like <EM> ) vs. formatting tags (like <I> ): It has always been envisioned that the tags within HTML documents can be used to group and search documents. For example, a search might look for all emphasized passages in a collection of web pages - in this case the search would involve only sections of text between the <EM> and </EM> tags. Since individual authors might have different preferences as to how emphasized text should look, if they each use different formatting tags the search would be impossible (some might use italics, some might use bold, etc).

HTML and text

Typically the body of an HTML document will include a number of text elements such as paragraphs, tables and lists. When rendering a paragraph a browser will wrap each line so that no word is split between lines - this means that the entire width of the browser window is used. White space within an HTML document including spaces, tabs and linefeeds (return characters) is used to delimit words (or tags) but are not rendered. If you put 2 spaces between words in an HTML document the browser will ignore this and put a single space (you can actually put 10 blank lines between words and the browser will still put a single space between them).

To separate individual paragraphs within a document you use the <P> and </P> tags to surround each paragraph. You can also use the <BR> tag (line break) to tell the browser to start a new line (without starting a new paragraph). NOTE: Unlike the other tags we've seen - the <BR> tag has no corresponding end tag.

HTML also support a <DIV> tag that can be used to divide the document in to discrete, named sections. The major benefits of doing this are that it provides an organizational tool for authors, and that some fancy stuff can be done to apply a style to an entire division (section) that can include text styles, margins, colors, etc.

According to the HTML standard, you should use <DIV> and <P> tags with their corresponding end tags. Here is what a correct document might look like:

<DIV name="section">
<P>This is the first paragraph in this document.
This is the first paragraph in this document.</P>

<P>This is the second paragraph in this document.
This is the second paragraph in this document.</P>

<P>This is the third paragraph in this document. 
By now you aren't even reading the sentences, are you?</P>
</DIV>

However, some people used to use <P> between paragraphs and the browsers seem to understand what to do (and still does). For example, consider the following document body with 3 paragraphs:

This is the first paragraph in this document.
This is the first paragraph in this document.<P>

This is the second paragraph in this document.
This is the second paragraph in this document.<P>

This is the third paragraph in this document. 
By now you aren't even reading the sentences, are you?

and here is what this looks like when rendered by the browser:

This is the first paragraph in this document. This is the first paragraph in this document.

This is the second paragraph in this document. This is the second paragraph in this document.

This is the third paragraph in this document. By now you aren't even reading the sentences, are you?

According to the standard each paragraph should start with <P> and end with </P>, but the above style is used in lots of (old)documents that are on the WWW (mostly older documents). However, I'd suggest using both start and end tags to make sure your documents will look right in the future.

HTML Headings

A number of tags are defined to be used to indicate section headings within a document. Typically a document contains a number of sections (chapters), and within each section are subsections, and within subsections are sub-subsections, and so on. The heading tags surround some text that is rendered by a browser, typically a section name (or subsection, etc). The heading tags are <H1>, <H2>, <H3>, ... <H6>, with H1 being the highest level heading (usually rendered the largest) and H6 the lowest level heading. For example - the following HTML:

<H1>Section 1: The Meaning of Life</H1>

<H2>Section 1.1: Nirvana and RPI</H2>

<H2>Section 1.2: Searching for Truth </H2>

<H3>Section 1.2.1: Using AltaVista in the search for truth</H3>

<H4>Section 1.2.1.1: Combining search terms</H4>

<H3>Section 1.2.2: Using GoTo.com in the search for truth</H3>

<H1>Section 2: The Meaning of HTML</H2>

Might be look like this when rendered:

Section 1: The Meaning of Life

Section 1.1: Nirvana and RPI

Section 1.2: Searching for Truth

Section 1.2.1: Using AltaVista in the search for Truth

Section 1.2.1.1: Combining search terms

Section 1.2.2: Using GoTo.com in the search for truth

Section 2: The Meaning of HTML

The <H1>, ... <H6> tags can include a number of attributes include the ALIGN attribute that tells the browser how to align the heading. The valid choices are ALIGN="CENTER", ALIGN="LEFT", ALIGN="RIGHT" and ALIGN="JUSTIFY". The justify value is not widely supported by any browser, but centering, left and right alignment work fine.

HTML Lists

HTML supports ordered (numbered) and unordered lists. Each list can include a number of list items, the browser renders these list items in a way that (hopefully) appears as a list.

Unordered lists are contained within the tags <UL> and </UL>. Ordered lists are contained within the <OL> and </OL> tags. In both cases each individual list item is contained within the <LI> and </LI> tags. Below are a few examples:


Dave's   favorite cookies:
<UL>
  <LI>  Chocolate Chip </LI>
  <LI>  Chocolate Chocolate Chip </LI>
  <LI>  Chunky Chocolate Chip </LI>
  <LI>  Oatmeal </LI>
  <LI>  Oreo </LI>
</UL>

Which looks like this:

Dave's favorite cookies:

  • Chocolate Chip
  • Chocolate Chocolate Chip
  • Chunky Chocolate Chip
  • Oatmeal
  • Oreo

Here is an ordered list:


Top 5 reasons to come to class:
<OL>
  <LI> Dave might bring cookies </LI>
  <LI> You might learn how to make an HTML list </LI>
  <LI> There is nothing on TV from 2:00-4:00 PM</LI>
  <LI> You can hide behind a pillar and sleep </LI>
  <LI> There might be a test</LI>
</OL>

Which might look like this:

Top 5 reasons to come to class:

  1. Dave might bring cookies
  2. You might learn how to make an HTML list
  3. There is nothing on TV from 2:00-4:00 PM
  4. You can hide behind a pillar and sleep
  5. There might be a test

HTML Tables

HTML supports the display of tabular data using tables. Tables are also used to manage document layout (probably more often than to display tabular data). The HTML table model includes three basic elements - the table ( <TABLE> and </TABLE> tags), table row ( <TR> and </TR> tags) and a table cell (using either <TH>,</TH> or <TD>,</TD> tags). The general structure supported by HTML is shown below, the idea is you build a table from table rows, and that you build table rows from table cells.

<TABLE>
  <TR>
     <TD> This is the first cell </TD>
     <TD> This is the second cell (still on the first row) </TD>
  </TR>
  <TR>
     <TD>  New row! </TD>
     <TD>  Another cell in the second row </TD>
  </TR>
</TABLE>

which will look like this:

This is the first cell This is the second cell (still on the first row)
New row! Another cell in the second row

The <TH> tag is used to table headings (TD stands for table data, TH for table headings) and simply changes the default text style used to display the contents of the cell. Using <TH> is usually remdered in boldface.

Below is a table that includes some attributes to alter the display of the table, including borders, background colors and multicolumn cells. There are other useful attributes that can be used to alter the display of a table - you can change the spacing between cells, the alignment of text in the cells, etc. Check any HTML reference for the details.

<TABLE BORDER=2 BGCOLOR=wheat>
  <TR BGCOLOR=WHITE>
     <TH colspan=3>Table Attributes</TH>
  </TR>
  <TR>
     <TH>Attribute Name</TH>
     <TH>Values</TH>
     <TH>Use</TH>
  </TR>
  <TR>
     <TD>BGCOLOR</TD>
     <TD><EM>any color name</EM></TD>
     <TD>Sets background color</TD>
  </TR>
  <TR>
     <TD>BORDER</TD>
     <TD><EM>border width in pixels</EM></TD>
     <TD>width of grid lines between cells</TD>
  </TR>
  <TR>
     <TD>CELLPADDING</TD>
     <TD><EM>Distances (1pt, 1in)</EM></TD>
     <TD>set space between cell edge and cell contents</TD>
  </TR>
</TABLE>

which looks like this:

Table Attributes
Attribute Name Values Use
BGCOLOR any color name Sets background color
BORDER border width in pixels width of grid lines between cells
CELLPADDING Distances (1pt, 1in) set space between cell edge and cell contents


Using Tables for Page Layout

Tables are often used to establish the layout for an entire page, for example to provide a menu on one side of the page and text on the other. The table below shows an example of this, but you can easily find better examples by viewing the HTML source of most web pages.

<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=10>
   <TR BGCOLOR=#808080><TD COLSPAN=3> </TD></TR>  <TR>
    <TD BGCOLOR=WHEAT VALIGN=TOP>

      <TABLE BORDER=0>
        <TR ALIGN=CENTER>
          <TH BGCOLOR=WHITE>Sites with stock quotes</TH> 
        </TR>
	<TR ALIGN=CENTER>
          <TD BGCOLOR=#8080FF>
           <A HREF=http://finance.yahoo.com>Yahoo finance</A>
          </TD>
        </TR>
	<TR ALIGN=CENTER>
          <TD BGCOLOR=#80FFFF>
           <A HREF=http://www.ragingbull.com>Raging Bull</A>
          </TD>
        </TR>
 	<TR ALIGN=CENTER>
          <TD BGCOLOR=#FF80FF>
           <A HREF=http://www.etrade.com>eTrade</A>
          </TD>
        </TR>
	<TR ALIGN=CENTER>
          <TD BGCOLOR=#80FF80>
           <A HREF=http://www.eschwab.com>Charles Schwab</A>
          </TD>
        </TR>
      </TABLE>
     </TD>

     <TD>
        <P>The sites shown in the menu all contain information about
	stocks and provide stock quotes. Some of these sites also
	support on-line trading of stocks, although you need to
	establish an account before you can start losing money.</P>
	   
	<P>This is really just filler text to show that you can use a
	table to do page layout. The text in this cell is treated just
	as if it was itself a page. You can include anything in a cell
	you would put anywhere in the body of an HTML document.</P>

        <H3>Here is a heading!</H3>

	<P>Did you notice that the other cell of this table contains
		a table within the cell? </P>
      </TD>
    <TD BGCOLOR=#808080> </TD>
    </TR>
   <TR BGCOLOR=#808080><TD COLSPAN=3> </TD></TR>
</TABLE>

This rather complicated table will end up looking like this:

 
Sites with stock quotes
Yahoo finance
Raging Bull
eTrade
Charles Schwab

The sites shown in the menu all contain information about stocks and provide stock quotes. Some of these sites also support on-line trading of stocks, although you need to establish an account before you can start losing money.

This is really just filler text to show that you can use a table to do page layout. The text in this cell is treated just as if it was itself a page. You can include anything in a cell you would put anywhere in the body of an HTML document.

Here is a heading!

Did you notice that the other cell of this table contains a table within the cell?

 
 

HTML Hyperlinks

Creation of a Hyperlink is done with the <A>,</A> tags. The text between the <A> and </A> tags becomes the link - when a user clicks on this text the browser open a new document. The location and name of the new document (the destination of the link) is included in the <A> tag as the value of the HREF attribute. This value is specified as a URL. A simple example:


<A HREF=http://www.cs.rpi.edu/~hollingd/eiw>
This is a link to the course home page. </A>
This is not a link.

When rendered by a browser:

This is a link to the course home page. This is not a link.

In the example above the destination of the link is the URL http://www.cs.rpi.edu/~hollingd/eiw. This URL is a fully specified URL, since it includes the specification of a protocol (http), a hostname (www.cs.rpi.edu) and a resource (/~hollingd/eiw). If you want to create a link to another document that is on the same web server as the one that provided the page containing the link, you can skip the protocol and hostname parts and use a relative URL. For example, you could use the /~hollingd/eiw on any page that is stored on the http server running on www.cs.rpi.edu.

Here is a relative URL in a hyperlink:
<A HREF=/foo>Grandpa is a relative</A>

If you want to specify the name of a file that is in the same directory as the current page, you can just use the file name itself (without any leading "/").

Here is a link to the file "blah" that can be
found in the same directory as this page:
<A HREF=blah>press for blah</A>

In general it is a good idea to use relative URLs whenever possible. This makes it easy to move an entire web site - all the URLs refer to files in the same directory, and are independent of the specific server hosting the web site. We will talk more about this when we look at building web sites.