Old Thesis: 1994-1995

My latest idea (I'm excited about it, and Martin has agreed at least that it has thesis potential), is that I could build a Parallel GIS Query System!

GIS Databases are read mostly
GIS Databases in business environments are not SUPERHUGE (< 1 Gig)
GIS Databases suffer from severe performance problems when dealing with the spatial queries
The cost of disk is extremely low
The results of a typical GIS query are typically small (from a data content standpoint)... compared to the size of the database.... for example:

"Show me all the phone poles within 5 feet of a fire hydrant"

The use of replication on ordinary networked workstations can provide a powerful parallel database query system with the above assumptions (especially the read-mostly part).

My proposal is the design of a parallel query system that takes advantage of my belief that GIS Queries are Disk I/O bound. To reduce Disk I/O, what I am proposing is a system composed of ordinary workstations networked together with ordinary hardware and software. A query is issued, and a query evaluator creates an execution plan based on the "type of query" and how parallelism can best be used to evaluate the query. The query is evaluated, in parallel, on ordinary workstations with replicated databases. The partial complete results (note... not intermediate ones) from each workstation are assembled to form the completed result.

An example of how this would work is the following query:

"Show me all the phone poles within 5 feet of a fire hydrant"

The query could be parallelized SPATIALLY into four quadrants, each quadrant search for phone poles is sent to a different workstation (for a total of four workstations), the results of each workstation's findings are sent to a central place and assembled for presentation to the user. Since each workstation has its own copy of the database, there is no specialized hardware (for dealing with things like intermediate joins, etc.).

In a nutshell a lab full of ordinary workstations, with some free disk space, can become a massive, parallel query engine.

Of course there are lots of questions that need to be answered:

Prove that my underlying assumptions are true
How much better is this than a standard serial database?
Are there certain types of queries that are more amenable to this type of parallelization... for example, a query that produces intermediate results that must be communicated to all of the database nodes, might actually take longer than doing it serially. Maybe the two types of queries can RACE to completion... the fastest wins and the loser is shut down mid-processing.
Can a class of "types of queries" be created... which result in different types of parallel execution plans being executed for each query type?
Can these types of queries be identified?
Do the spatial access methods that exist need to be modified to take advantage of this parallelism?
Could a system that snoops & steals free disk space (like ones that snoop and steal CPU cycles) be developed to maximize the disk usage workstations in this replicated database environment.