\documentclass{article}
\usepackage{hyperlatex}
\newcommand{\hlink}[3]{\xlink{#1}{#2}{#3}\texonly{\footnote{#2}}}
\newcommand{\tild}{\~{}}
%\input{techexp}
\texonly{\textheight 9in}
\texonly{\topmargin -0.5in}
\texonly{\oddsidemargin 0.5in}
\texonly{\textwidth 5.5in}
\newcommand{\HlxIcons}{}
\newcommand{\base}{http://www.cs.rpi.edu/courses/spring00/dsa/projdir/project3/}

\htmltitle{CSCI 2300 --- Data Structures and Algorithms --- Project 3}
\htmladdress{~}
\htmlattributes{BODY}{BGCOLOR="#ffffe6"}
\setcounter{htmlautomenu}{2}
\htmlonly{\setcounter{secnumdepth}{0}}
\setcounter{htmldepth}{0}

\title{CSCI 2300 --- Data Structures and Algorithms\\
       Project 3 --- Jog Phone Company}

\author{~}
\begin{document} \maketitle

\subsection*{Submission Deadline: Friday, March 31 at 11:59:59~pm}

Many applications require fast storage and retrieval of data of some
type {\tt T} based on {\em keys} of some other type {\tt Key}.  If the
set of possible keys is a reasonably small range of integers then the
table could simply be an array indexed by the keys.  Often though the
number of possible keys is much too large for an array, as for example
when the keys are people's names or social security numbers.  (Since
social security numbers are 9 digits long, the array would have to
have room for one billion entries.) So instead we must store both keys
and their associated data values, in an abstract data type called a
{\tt table} or {\tt dictionary}.  When we need to retrieve the value
associated with a given key $k$, we have to search for $k$ among the
stored keys.  We want to arrange the table so that such searches can
be done quickly, and we assume that it's also required that inserting
or deleting (key, data) pairs must be efficient operations also.  The
first solution we might think of is an array or vector of (key, data)
pairs, sorted based on the keys, so that keys can be looked up quickly
using a binary search.  (Why is a vector to be preferred over a plain 
array?)  In order to keep the table sorted, a new entry must be made
at the correct place in the table (we can't just put it at the
beginning or end).  In an array, such insertions take linear time
(time proportional to the size of the table), which can be very slow
for a large table.  Thus, using an array or vector is satisfactory
only if the number of insertions and deletions is very small compared
to the number of lookup operations.

The STL library provides {\em sorted associative containers}, which
are a solution to this problem.  With these containers, insertion,
deletion, and lookup are all fast operations---they work in
time proportional to the {\em logarithm} of the size of the container.
These containers are implemented with {\em balanced binary search
trees}, which, as we've seen in Chapter~4 of the textbook, 
does support these operations with a logarithmic bound
on their computing time.  In this project, you are to write a couple
of short programs that construct tables of (key, data) pairs.  One
program should use the STL associative containers, {\tt map}.  
However, it's possible to do better than logarithmic time for insertion,
deletion, and lookup, on average, by using a {\em hash table}, which is another
type of associative container.  Hash tables have average case {\em
constant} time bounds on insertion, deletion, and lookup.  Their
disadvantage is that their worst case behavior can be very bad
compared to balanced binary search trees---linear time instead of
logarithmic.  Achieving the optimal constant time behavior requires
careful choice of the hash function and other parameters.  In this project
you will experiment with these factors and make some measurements of
the performance of your programs using timing functions that are
provided in another of the standard C/C++ libraries, {\tt time.h}.

\section*{An application of associative containers}

The application to be programmed and timed is as follows. The Jog
Phone Company keeps track of phone connections in a table.  The key of
an entry in the table is the calling phone number and the associated
data is the called number and the time the call started.  Insertions
and deletions in this table are made in response to {\em call
events}.  A call event is either a connection or
a disconnection.

The program operates in either a verbose mode, in which it outputs
a commentary on the events and its actions, or a silent mode, in which
it only prints overall statistics at the end of its operation.

In verbose mode, the program's response to a connection event is
either ``{\em n2} is busy'' or ``At time {\em t} connected call from
{\em n1} to {\em n2}'', where {\em n1} is the calling number, {\em n2}
is the called number and {\em t} is the time at which the event was
processed.  The ``busy'' response occurs if there is already a
connection involving the called number.  Otherwise, in either mode, an
entry is made in the table with {\em n1} as the key and ({\em n2, t})
as the associated data, and another entry with {\em n2} as the key and
({\em n1, t}) as the associated data.

The response to a disconnection event with number {\em n} is ``At time
{\em t} disconnected call between {\em n1} and {\em n2}, duration:
{\em d}'' where {\em n1} or {\em n2} is {\em n}, {\em t} is the time
at which the event was processed, and {\em d} is the difference
between {\em t} and the start time of the connection.  There is no
action taken if there is no call in progress with {\em n1} or {\em n2}
equal to {\em n} (i.e., the event is ignored).  But if there is such a
call in progress, it is removed from the table by deleting both of the
entries involving phone number {\em n}.

\section*{Also a discrete event simulation}

The program you are to write (each version) performs a simulation of
the phone company's operation in handling a specified number of call
events and entering them or deleting them from the table.  The type of
simulation it should do is called a \emph{discrete event simulation}.
(See also Weiss, Section~6.4.3.)  Such simulations are an
important application of priority queues.  A discrete event simulation
consists of processing of events, such as the connection and
disconnection events of the phone company.  Events, including the
times at which they occur, might be generated randomly. These times do
not have to be exact time-of-day times; instead we can just use
multiples of some quantum unit, called a tick.  Similarly, times
between events can also be generated randomly.  A simple form of
simulation would be to start the clock at zero ticks, then advance it
one tick at a time, checking to see if there is an event. This is a
\emph{discrete time-driven simulation}.  The problem with a discrete
time-driven simulation is that it can waste a lot of computing time
just advancing the clock tick by tick to the next event. The problem
is worse when the events occur infrequently compared to the
granularity of the clock.  A better way: after completing each event,
advance the clock instantly to the time for the next event. This is
called a discrete event-driven simulation.  Since we need to find the
nearest event in the future among those waiting to be processed, a
priority queue is the natural data structure to use to hold the
events.

\section*{Program parameters and output}

The program should take four command line parameters.
The first parameter {\em N} indicates the number of events to
generate, the second parameter {\em p} is a probability that
determines the rate at which connection events occur, the third
parameter is a seed for the random number generator, and the fourth is
a flag that controls whether the program operates in verbose or silent
mode.  The value {\tt log} means verbose mode, while {\tt nolog} means
silent mode.

A random connection event is generated by using
a random number generator to generate two 7-digit numbers and
a start time for the call.  This information is packaged in the event
and it is placed in the event queue, which is set up as 
a priority queue in which earliest (smallest) times have the
highest priority.  When the connection event is later removed
from the queue, a disconnection event is constructed for
it with its time set to the current time plus a randomly 
generated duration, and this event is entered in the queue.

After the prescribed number of events has been processed, the 
program should display (on {\tt cout}) a line that tells the 
number of calls completed and the average duration of the completed
calls and the average number of active connections
(calls that are still in progress are ignored in this summary).
Finally, it should display the amount of time taken to process all of
the events.  This (real, not simulated) time can be computed using the 
function {\tt clock} from {\tt time.h}:
\begin{verbatim}
   clock_t start, duration;
   ...
   start = clock();
   ... code to be timed
   duration = (double)(clock() - start)/CLOCKS_PER_SEC;
   cout << "Time used: " << duration << " seconds\n";
\end{verbatim}
The type \texttt{clock\_t} and the factor \texttt{CLOCKS\_PER\_SEC} 
are also defined in {\tt time.h}.

\section*{Two versions required: maps versus hash tables}

Write two versions of the program, one that keeps the connected calls
in an STL {\tt map}, and the other that keeps them in an hash table
that you implement.  Represent the phone numbers as strings (C++
strings, using the \texttt{<string>} header), and the times or time 
differences as {\tt float}s.  Thus the {\tt map} should be declared as
\begin{verbatim}
   map<string, pair<string, float> >
\end{verbatim}
That is, it associates a {\tt pair} consisting of a {\tt string} and a
{\tt float} with each key of type {\tt string}, and the ordering of
the keys is computed using the usual {\tt <} operator on {\tt
strings}s.  In the {\tt map}, each entry is of the form {\tt pair({\em
n1}, pair({\em n2}, {\em t}))}. 

For the hash table version, you must also implement a hash table
class.  In \hlink{Sample hash table} 
{http://www.cs.rpi.edu/courses/spring00/dsa/projdir/project3/hash.h}{}
there is already a hash table class, implemented using separate
chaining.  (It is similar to the hash table class in the Chapter~5
online notes, but the interface is extended and modified some to make
it more similar to the the STL map interface.)
You must implement your hash table with \emph{exactly the same public
interface}, but using \emph{open addressing with quadratic probing}
instead of separate chaining.  (See Section 5.4.2 in the textbook,
where some of the implementation is given.)

As mentioned, the public interface of this hash table
class is designed to be very similar to that of the STL map
class---more precisely, it is very similar to the partial
specialization
\begin{verbatim}
template <class ElementType> map<string, ElementType>
\end{verbatim}
i.e., the keys must be strings.  This similarity means that very few
changes to your phone company simulation program should be required to
change it from the map version to the hash table version.  For
simplicity, the given hash table class omits some features of the map
class; for example, it defines \texttt{iterator} and
\texttt{const\_iterator} types but doesn't provide them with all of
the usual iterator operations.  This isn't a problem for the phone
company simulation, because it doesn't need full-blown iterators.
Note that GNU C++ does provide a set of hashed associative containers
(hash\_set, hash\_multiset, hash\_map, hash\_multimap) that have
full-blown iterators and are as fully compatible with the sorted
associated containers as possible. (These containers were actually
developed by the generic programming project headed by Alexander
Stepanov at Silicon Graphics; they are not part of the C++ standard
library but are part of SGI's version of STL.  For purposes of this
project you are not permitted to use any of SGI's hash tables.  In fact,
they don't meet the project specification, because they are implemented
with separate chaining.)

\section*{Performance comparisons}

Since insertions and deletions in the connections table are 
a large part of the computation in the the phone company simulation,
it is a good benchmark for comparing the performance of hashed
associative containers versus sorted associative containers
(hash tables versus balanced binary search trees).  The hash table
version of your program should be somewhat faster than the
map version.  Is it?  How much faster (or slower)?  Include in
the README file you submit with your project a brief report
on the performance results you obtained.  You should report
the times for the two versions on several (silent) runs with
at least the values 1000, 10000, 20000, 40000, 80000 for
the number of connections.

\section*{Additional notes}

\begin{itemize}
\item To get a better understanding of what the Job Phone Company simulation
is supposed to do, look at some \hlink{Sample Runs}
{http://www.cs.rpi.edu/courses/spring00/dsa/projdir/project3/examples.html}{} 
\item You can also find two compiled versions of the program (both using a
map) in \hlink{Programs}
{http://www.cs.rpi.edu/courses/spring00/dsa/projdir/project3/programs/}{}
called {\tt jog} (SUN version) or {\tt jog.exe} (Windows PC version).
Try them out with various command-line
parameters to give you still more information about what is expected.
\item Some additional notes and advice are available in \hlink{Notes}
{http://www.cs.rpi.edu/courses/spring00/dsa/projdir/project3/notes.html}{}
\end{itemize}

\section*{Submission dates and guidelines}

The submission deadline is Friday, March 31 at 11:59:59~pm.  
Projects submitted later than this will NOT be
accepted for credit.

This project will require a number of files, so submission 
will require some extra steps. Make sure that every file 
that you submit has your name, section meeting time, and 
instructor name in a comment at the top.

\subsection*{Copy your files to RCS}

Copy all of the files that you will submit to RCS.  This
will include a readme file, make file, and 
any file with a {\tt .cpp} suffix or a {\tt .h}
suffix, but {\em not}, I repeat {\em not} files with a 
{\tt .exe}, {\tt .dsp}, {\tt .dsw} {\tt .ncb} or {\tt .opt}
suffix.

Your submission must include a text file called {\tt readme.txt}
This file must include your name, section meeting time, 
and instructor name.  In addition, it {\bf must} include a list of all the
inplementation files and any special instructions
that the TA will need to know in order to run your program.

Example:
\begin{verbatim}
Snoop Doggie Dog
Section 8: Thursday, 6 PM
Eric Breimer 

hash.h
main.cpp

To run the program type
jog.exe N p seed [nolog | log]
where N is the number of events
p is the probability of that event occurs in one unit of time
seed is the random number seed
and the last argument is log if you want a full output log
or nolog if you just want summary results
\end{verbatim}

\subsection*{Create a tar file}

Login to a Unix system and do the following:

Create a {\tt tar} file. Tar is an acronym for Tape Archive.
To create a tar file, use the tar command with two flags {\tt c}
(for create) and {\tt f} (for file), followed by the file name of
the tar file (it should have a {\tt .tar} suffix) followed by
the names of all of the files that you want to send.  For example
the following command:\\
\verb+   tar cf temp.tar myfile.h myfile.cpp readme.txt+\\
creates a tar file called {\tt temp.tar} which contains
three files, {\tt myfile.h, myfile.cpp} and {\tt readme.txt}.

You will need to check to make sure that the tar file is correct.
The best way to do this is to copy the tar file to another 
directory and run the following command:
\verb+ tar xf temp.tar+\\
After running this command, the files {\tt myfile,cpp, myfile.h}
and {\tt myfile.rc} will be in the new directory. 
Here are the unix commands to do this:
\begin{verbatim}
mkdir newdir
cp temp.tar newdir/temp.tar
cd newdir
tar xf temp.tar
\end{verbatim}
The first line creates a new directory called {\tt newdir}, the
second copies the tar file to the new directory, the third line
makes {\tt newdir} the current directory, and the fourth line
extracts the files.


\subsection*{Copy the tar file}

The submission directory is {\tt /dept/cs/cs230/project3/}.
In the submission directory are sub-directories for each section.
Copy your tar file to the proper sub-directory.  For example,
if you are in section 8, copy your tar file to 
{\tt /dept/cs/cs230/project3/section8}.
The name of your tar file must be {\em login}{\tt.tar},
where {\em login} is your RCS login.  For example,
if your login name is {\tt snoopd}, your tar file must 
have the name {\tt snoopd.tar}.  If you want to update your
submission, the name of your tar file must be
{\em login}{\tt-2.tar}.  Keep incrementing the number every time
you want to update your file.  Therefor, the name of your 
third update should be {\em login}{\tt-3.tar} and your fourth
update should be {\em login}{\tt-4.tar}.

Use the unix copy command {\tt cp} to copy your submission.  For
example:

\verb+cp temp.tar /dept/cs/cs230/project3/section8/snoopd.tar+

{\bf Please}, make sure you copy your tar file into the
subdirectory for your section and verify that the name of
your tar file is your RCS login. 
{\bf You will not receive credit for your submission unless
you follow the submission guidelines.}

\subsection*{Check list}
\begin{itemize}
\item  Make sure the name of your tar file matches your RCS user id.
\item  Make sure you submit to the proper section.
\item  Make sure your tar file includes all .h and .cpp files.
\item  Make sure all your files are text format before you use tar.
\end{itemize}

\subsection*{Grading criteria}

\begin{itemize}
\item 25\% {\bf Hash Table Class Design and Effort:}
The hash table must use the given interface and must use quadratic probing.
Although your code may not be completely correct and may not compile,
you can still earn the full 25\%.
You must alter the find, insert, erase, CopyTable, 
ExpandTable, DeleteTable, and AllocateTable so the the Hash Table efficiently
implements quadratic probing.

\item 25\% {\bf Discrete Event Simulation Class Design and Effort:} 
Your code must attempt to implement a realistic discrete event simulation. 
Although your code may not be completely correct and may not compile, you 
can still earn the full 25\%.
Your event simulator must attempt to generate N connection events 
according to probability p (the probability that a connection event 
is generated in one unit of time).
The duration of the calls should be between 0 and 2000 seconds and only
valid disconnection events should be generated.


\item 35\% {\bf Hash and Map Comparison}
Your program must have two versions.  One version maintains the 
connection table using a map and the other version uses a 
hash table with quadratic probing. You should report
the running times for the two versions on several runs with
at least the values 1000, 10000, 20000, 40000, 80000 for
the number of connections.  Your program should verify what you've reported.
If you summary information is realistic and consistent, 
and your hash table implementation outperforms the map implementation, 
then you will get 35\%.  
Otherwise, you should explain in your report why you could not achieve 
these results.  If your hash table does not outperform the map 
and you don't offer a reasonable
explanation, then you will get 30\%.  If you were not able to implement 
the quadratic probing hash table, you should still report the data
for the map and you will still recieve 20\%.

\item 10\% {\bf Compilation}
Even if your program does not produce correct or realistic summary 
information, you can still earn
up to 10\% for successful compilation.
You can not get credit for compilation unless
you've made a valid attempt to implement all the components 
(quadratic probing hash table,
discrete event simulation, and summary calculations).

\item 5\% {\bf Coding Guidelines:}
You must make a valid effort to follow the coding guidelines whenever 
appropriate. Use your best judgement to determine which guidelines 
should be followed. 
If you are in doubt, explain (in your comments) your choice to follow or not
to follow certain guidelines.
You can not get credit for the coding guidelines unless
you've made a valid attempt to implement all the components (quadratic probing hash table,
discrete event simulation, and summary calculations).
\end{itemize}

\subsection*{Academic integrity}

Students may discuss the class and algorithm design.
All code must be implemented by the student alone.
Sharing of code among students is expressly
forbidden and will be detected by code comparison tools.  Projects
from all eight sections will be graded together by the TAs.

\end{document}
