Old Thesis: 1994-1995
My latest idea (I'm excited about it, and Martin has agreed at least that it has thesis potential), is that I could build a Parallel GIS Query System!
GIS Databases are read mostly
GIS Databases in business environments are not SUPERHUGE (< 1 Gig)
GIS Databases suffer from severe performance problems when
dealing with the spatial queries
The cost of disk is extremely low
The results of a typical GIS query are typically small (from a
data content standpoint)... compared to the size of
the database.... for example:"Show me all the phone poles within 5 feet of a fire hydrant"
The use of replication on ordinary networked workstations can provide
a powerful parallel database query system with the above
assumptions (especially the read-mostly part).
An example of how this would work is the following query:
"Show me all the phone poles within 5 feet of a fire hydrant"
The query could be parallelized SPATIALLY into four quadrants, each quadrant search for phone poles is sent to a different workstation (for a total of four workstations), the results of each workstation's findings are sent to a central place and assembled for presentation to the user. Since each workstation has its own copy of the database, there is no specialized hardware (for dealing with things like intermediate joins, etc.).
In a nutshell a lab full of ordinary workstations, with some free disk space, can become a massive, parallel query engine.
Of course there are lots of questions that need to be answered:
Prove that my underlying assumptions are true
How much better is this than a standard serial database?
Are there certain types of queries that are more amenable
to this type of parallelization... for example, a query
that produces intermediate results that must be communicated
to all of the database nodes, might actually take longer
than doing it serially. Maybe the two types of queries
can RACE to completion... the fastest wins and the loser
is shut down mid-processing.
Can a class of "types of queries" be created... which result
in different types of parallel execution plans being executed for
each query type?
Can these types of queries be identified?
Do the spatial access methods that exist need to be modified
to take advantage of this parallelism?
Could a system that snoops & steals free disk space (like ones
that snoop and steal CPU cycles) be developed to maximize
the disk usage workstations in this replicated database
environment.