Web Science Workshop

Position Paper

Henry S. Thompson
1 Sep 2005

1.   An important distinction

The 'old-fashioned' Web (OFW) is over 10 years old, it is ubiquitous, it is measured in billions of pages and uncountable numbers of http GETs. The Semantic Web (SW) is half that age and a tiny fraction that size.

It's commonly believed that a key reason the OFW grew so fast in its early days was because anybody could produce a web page about their interests/areas of expertise, and anyone else (well, anyone else who could read English :-) could find it, benefit from it, link to it, contribute their own information.

One way of summarising (trivialising?) the SW is to say that its goal is to do the same thing for mechanical/non-human users.

There is a lot of opinion, and some science, available with respect to the question of what properties of the OFW technology account for its success, see for example The Architecture of the World Wide Web. I think it's fair to say that we have no where near enough experience with the SW yet to be able to tell which, if any, of the things which make the OFW work are needed for the SW.

Another way to look at this: Some of the obvious antecedents to the OFW were FTP, Gopher and Hypertext. We have a pretty good idea of which of the ways in which the founding technologies of the OFW (http and HTML) were different from those antecedents were implicated in its leaving them in the dust. Some of the antecedents for the SW to be found in Information Retrieval, Database Theory (ER modelling) and AI (Semantic Nets, CYC, etc.). What aspects of the differences between SW and those antecedents do we see as fundamental to its eventual success. Some possible candidates:

2.   A problem for the OFW

We need identifiers for, and machine-exploitable descriptions of, languages on the Web, their evolution and their relationships. We need an understanding of how names are scoped by namespaces and defined by languages.

An XML language is a collection of names for elements, attributes and other things (a vocabulary), a grammar which specifies how they can be combined to form documents, and a specification of the meaning(s) or function(s) of each name in the vocabulary. The names in an XML language are local to that language, and we don't have a systematic way of associating the Web's generic universal names, that is, a URI, with any such local name or name/definition pair.

A major issue with respect to names on the Web is that names within languages are not static. As languages evolve names appear and disappear and their use changes. Indeed versioning has been a topic of concern for XML language designers and users almost as long as XML has existed. Precisely because we have formal mechanisms for defining XML vocabularies and languages, e.g. DTDs, W3C XML Schemas, Relax NG schemas, the pressure to provide concrete support for versioning is very strong.

We need a means to define languages and provide for their evolution. This will involve making it possible to name a language, to distinguish clearly a language from a version of that language, and to describe in a constrained and explicit way the relation between one version of a language and another.

XML namespaces will obviously play a part in this story, but we cannot assume that (versions of) languages are always identified with namespaces. Some widely used languages don't use namespaces. Some use more than one. Some widely used languages use namespaces and change them with (major) version changes. And finally some widely used languages don't change namespaces across even major versions.

It is a fundamental principle of Web architecture that if something is of any significance on the Web, it should have a URI. This is not currently the case for most language-defined names. We would like, for example, to have a URI for "the 'P' element in HTML" and, indeed, for "the 'P' element as defined in the Transitional dialect of XHTML 1.0 of 26 January 2000", but we don't. This is a bug, and it needs to be fixed.

3.   The Statistical Semantic Web

One take on the history of AI over the last 30 years is one which sees statistical processes which deliver probably nearly correct answers slowly displace all but the most narrowly-constrained uses of traditional KR+Inference approaches.

Is there a parallel here for the SW? In a talk I gave recently at a celebration of the work of Karen Sparck Jones [very crude HTML, please do not cite or circulate] I suggested that perhaps there was. Certainly much of the early SW rhetoric about metadata and global synergy has been overtaken by the reality of success stories which are tightly focussed on specific application domains using ontologies agreed in advance by all the players.

Together with Harry Halpin, a PhD student in Edinburgh, I've been working on an approach to naming for the SW, called Web Proper Names, which we think points towards the alternative suggested by the AI parallel, namely, instead of focussing on the logic-based exact-answer goal, focus on the global synergy goal, accepting that the results will only probably be nearly correct.

The fundamental challenge here is that of identity of reference. That is, achieving useable synergy between independently developed metadata requires a means for determining when two URIs are as-near-as-makes-no-nevermind being used for the same subject or property.

There's even an interesting convergence of statistical NLP with this idea -- preliminary results from work by Johan Bos and others at Edinburgh suggest that useable shallow local ontologies can be derived from HTML web pages fully automatically. It's a short step to using such ontologies to automatically generate metadata using Web Proper Names. . .