CSCI 2300 -- Data Structures and Algorithms
Project 2 -- File Compression, Inc.
Submission Deadline: Friday,
March 3 at 11:59:59 pm
Your successful completion of the ``Meows and Mutts'' project has resulted
in a mild case of kennel cough and a referral to File Compression, Inc.,
a company specializing in software for compressing the size of files. Mr.
Zvi Lepmel, Chief Technical Officer of FCI, has brought you in as a consultant
to help with the implementation of a new product that performs a type of
compression known as
Huffman coding. (You can find some background
information on this technique in your textbook in Section 10.1.2.)
Huffman coding requires the ability to manipulate binary trees, so the
main deliverable for this project will be the implementation of a binary
tree class, similar to the generic container classes found in the C++ Standard
Library. You have already met with the FCI team and sketched out the following
interface for the class:
template <class T>
class binary_tree
{
public:
binary_tree();
~binary_tree();
binary_tree(const T &);
binary_tree(const binary_tree &);
binary_tree & operator=(const binary_tree &);
friend class iterator;
class iterator
{
public:
friend class binary_tree;
iterator();
~iterator();
T & operator*() const;
bool operator!=(const iterator &) const;
bool operator==(const iterator &) const;
iterator left() const; // go to left child node
iterator right() const; // go to right child node
iterator parent() const; // go to parent node
bool is_root() const; // true if at root
bool is_leaf() const; // true if at a leaf
};
iterator begin() const;
iterator end() const;
T & root();
void combine(binary_tree & t1, binary_tree & t2);
};
In addition to the usual constructors and destructors, the binary tree
class also contains an iterator class for traversing over the nodes in
the tree. Binary tree iterators support some of the normal iterator operations,
plus some special functions for traversing binary trees. Additional member
functions of the binary tree class provide iterators to the ``begin'' (root)
node and a special ``end'' node not in the tree, and fetch the value stored
in the root node of the tree.
Here is an example of how the binary tree class and its iterator class
is intended to be used:
// Print out contents of tree using an ``in-order'' traversal.
void inorder_traversal(const binary_tree<int>::iterator & i)
{
if (i.is_leaf())
{
std::cout << *i << " ";
}
else
{
inorder_traversal(i.left());
std::cout << *i << " ";
inorder_traversal(i.right());
}
}
void main()
{
binary_tree<int> tree;
... // put some stuff into the tree
inorder_traversal(tree.begin());
}
One member function deserves special mention. Since a binary tree container
will be made up of dynamically allocated nodes, it is possible to ``combine''
two binary trees as the children of the root node of a third binary tree
in constant time (this is similar to how STL lists can be spliced together).
A special function is provided in the header and should be implemented
by you to do exactly this. The nodes in t1 become the left subtree,
and the nodes in t2 become the right subtree of the third tree
(represented by *this). The ``root'' tree must consist of only
a single node, and both t1 and t2 are empty afterwards.
This sounds complicated, but remember it will be done in constant time!
With the help of this binary tree class (and some useful standard library
components), a class for performing Huffman encoding and decoding can be
implemented. The interface for this class is quite simple it provides two
member functions, one to encode and one to decode. Both use standard library
streams for their input and output.
class huffman_codec
{
public:
void encode(std::istream &, std::ostream &);
void decode(std::istream &, std::ostream &);
};
Encoding Algorithm
The Huffman encoding algorithm is as follows:
-
1.
-
Construct the encoding tree:
-
(a)
-
Count the frequency of each byte value in the input (note that there are
28 = 256 possible byte values).
-
(b)
-
Construct an initial binary tree for each byte value, weighted by its frequency.
-
(c)
-
Repeat until there is only a single tree remaining: combine the two trees
with the smallest weights into a new tree, with weight equal to the sum
of that of the two combined trees.
-
2.
-
Construct the encoding table: for each byte value in the input, determine
its code by tracing from the leaf node representing that value to the root
of the tree. (Note that this will give you the code backwards, and
you will need to reverse it.)
-
3.
-
Write out the compressed data:
-
(a)
-
Write out the encoding tree (it is necessary to decode the file later).
See below for a technique to do this.
-
(b)
-
Write out the count of characters in the input.
-
(c)
-
Write out the compressed data.
Decoding Algorithm
To decode a file:
-
1.
-
Read in and reconstruct the encoding tree (see below).
-
2.
-
Read in the character count.
-
3.
-
While there are characters left to decode:
-
(a)
-
Start at the root of the encoding tree.
-
(b)
-
While not at a leaf node, read a bit from the input and go to the left
or right child as appropriate.
-
(c)
-
Output the character in the leaf node.
Encoding Tree Output
Writing out the encoding tree is fairly straightforward. First, we can
ignore the weight information, since it is only used in the initial construction
of the tree from the frequency data. Also, we can take advantage of the
fact that every node in the tree is either a leaf or has exactly two children.
The algorithm to write out the tree is recursive:
-
1.
-
If the root node of the tree is itself a leaf:
-
(a)
-
Write out the byte value stored in the leaf, preceded by the character
``.'' (period).
-
(b)
-
Return.
-
2.
-
Otherwise:
-
(a)
-
Recursively write out the left subtree, surrounded by parentheses.
-
(b)
-
Recursively write out the right subtree, surrounded by parentheses.
Encoding Tree Input
Reading in and reconstructing the tree is also recursive:
-
1.
-
Read a character byte.
-
2.
-
If the character is ``.'' (period), then we have a leaf:
-
(a)
-
Read the next byte, and create a new tree with that value in its node.
-
(b)
-
Return the new tree.
-
3.
-
Otherwise, the character must be a left parentheses, indicating we have
two subtrees:
-
(a)
-
Recursively read in the left subtree.
-
(b)
-
Read in and skip the closing right parentheses and the next left parentheses.
-
(c)
-
Recursively read in the right subtree.
-
(d)
-
Read in and skip the closing right parentheses.
-
(e)
-
Create a new tree combining the left and right subtrees, and return it.
Program synopsis and input format
The program will be run twice with the command-lines
huffman encode input-file compressed-file
huffman decode compressed-file output
The decoded output will be compared to (and should be identical with) the
input file. You can use any file for sample input (even binary files).
Notes, hints, and assumptions
-
Engineers at FCI have already implemented a class for reading and writing
individual bits, as well as a main program to test your classes. The code
can be found in:
-
The bit_unpacker class provides a function that returns individual
bits from a stream:
bool read(std::istream & in);
The bit_packer class provides two functions, one that writes a
string of bits to a stream, and a second that should be used after you
have processed the last bit to ensure all bits are actually written (this
is necessary since the system can only write bytes, or groups of 8 bits,
at a time):
void write(std::ostream &, const std::basic_string<bool> &);
void flush(std::ostream &);
It is easiest to deal with strings of bits using the following type:
std::basic_string<bool>
This type works just like the std::string class you have used
before (including working with standard library iterators and algorithms).
You will need some way to keep track of the ``forest'' of binary trees
you are combining to form the final encoding tree. To do this, use the
priority_queue
class in the standard library.
Remember that standard library containers copy the values that they hold.
It will be very expensive to copy an entire binary tree each time it is
added to or removed from a container, so store a pointer to the tree instead.
The standard library component pair will be useful for holding
two items (the weight and the byte value) in a binary tree node.
What happens if the file being compressed is empty? What if it contains
only a single byte (perhaps repeated over and over)? What should the compressed
file look like for these cases? Does your final program handle them correctly?
If you are using Microsoft Visual C++, you may get some warnings like the
following:
c:\program files\devstudio\vc\include\iosfwd(83) : warning
C4804: '<' : unsafe use of type 'bool' in operation
c:\program files\devstudio\vc\include\iosfwd(127) : warning
C4305: 'return' : truncation from 'const int' to 'bool'
These are in the standard library and can be safely ignored.
Implementation environment
Your program must compile and run using the Visual C++ developement environment.
Submission dates and guidelines
The submission deadline is Friday, March 3 at 11:59:59 pm. Projects submitted
later than this will NOT be accepted for credit.
This project will require a number of files, so submission will require
some extra steps. Make sure that every file that you submit has your name,
section meeting time, and instructor name in a comment at the top.
Copy your files to RCS
Copy all of the files that you will submit to RCS. This will include a
readme file, make file, and any file with a .cpp suffix or a .h
suffix, but not, I repeat not files with a .exe,
.dsp,
.dsw
.ncb or .opt suffix.
Your submission must include a text file called readme.txt
This file must include your name, section meeting time, and instructor
name. In addition, it must include a list of all the inplementation
files and any special instructions that the TA will need to know in order
to run your program.
Example:
Snoop Doggie Dog
Section 8: Thursday, 6 PM
Eric Breimer
binary_tree.h
huffman_codec.h
huffman_codec.cpp
main.cpp
To run the program type
huffman encode infile compressed-file
or
huffman decode compressed-file outfile
Create a tar file
Login to a Unix system and do the following:
Create a tar file. Tar is an acronym for Tape Archive. To create
a tar file, use the tar command with two flags c (for create)
and f (for file), followed by the file name of the tar file (it
should have a .tar suffix) followed by the names of all of the
files that you want to send. For example the following command:
tar cf temp.tar myfile.h myfile.cpp readme.txt
creates a tar file called temp.tar which contains three files,
myfile.h,
myfile.cpp and readme.txt.
You will need to check to make sure that the tar file is correct. The
best way to do this is to copy the tar file to another directory and run
the following command:
tar xf temp.tar
After running this command, the files myfile,cpp, myfile.h
and myfile.rc will be in the new directory. Here are the unix
commands to do this:
mkdir newdir
cp temp.tar newdir/temp.tar
cd newdir
tar xf temp.tar
The first line creates a new directory called newdir, the second
copies the tar file to the new directory, the third line makes newdir
the current directory, and the fourth line extracts the files.
Copy the tar file
The submission directory is /dept/cs/cs230/project2/. In the submission
directory are sub-directories for each section. Copy your tar file to the
proper sub-directory. For example, if you are in section 8, copy your tar
file to /dept/cs/cs230/project2/section8. The name of your tar
file must be login.tar, where login is your RCS
login. For example, if your login name is snoopd, your tar file
must have the name snoopd.tar. If you want to update your submission,
the name of your tar file must be
login-2.tar. Keep incrementing
the number every time you want to update your file. Therefor, the name
of your third update should be login-3.tar and your fourth
update should be login-4.tar.
Use the unix copy command cp to copy your submission. For example:
cp temp.tar /dept/cs/cs230/project2/section8/snoopd.tar
Please, make sure you copy your tar file into the subdirectory
for your section and verify that the name of your tar file is your RCS
login. You will not recieve credit for your submission unless you follow
the submission guidelines.
Check list
-
Make sure the name of your tar file matches your RCS user id.
-
Make sure you submit to the proper section.
-
Make sure your tar file includes all .h and .cpp files.
-
Make sure all your files are text format before you use tar.
Grading criteria
-
25% Binary Tree Implementation and Design:
Your binary tree must implement the functions described in the binary_tree.h
file that was given. Although your code may not be completely correct and
may not compile, you can still earn the full 25% (given a valid effort).
-
20% Encoding Implementation and Design:
Your code must attempt to implement the huffman coding algorithm. You
must properly use the binary tree to generate encodings. Although your
code may not be completely correct and may not compile, you can still earn
the full 20% (given a valid effort).
-
10% Decoding Implementation and Design:
Your code must attempt to implement the huffman decoding algorithm.
You must properly use the binary tree to decode the compressed text. Although
your code may not be completely correct and may not compile, you can still
earn the full 10% (given a valid effort).
-
15% Coding Guidelines:
Failure to follow the coding guidelines will result in up to a 15%
penalty. You can not get credit for the coding guidelines unless you've
made a valid attempt to implement all the components (binary tree, encoding
algorithm, and decoding algorithm).
-
10% Compilation:
Even if you program does not encode or decode properly, you can still
earn up to 10% for successful compilation. You can not get credit for compilation
unless you've made a valid attempt to implement all the components (binary
tree, encoding algorithm, and decoding algorithm).
-
20% Correctness:
Using the encode option, your program should generate a compressed
file that is smaller than the original input file (assume that you are
given a reasonably sized input file; some small files will not compress
well). Then, using the decode option, your program must generate the original
input file. Proper encoding can not be verified without proper decoding.
If your program encodes and decodes properly, you earn an automatic 20%
Academic integrity
Students may discuss the class and algorithm design. All code must be implemented
by the student alone. Sharing of code among students is expressly forbidden
and will be detected by code comparison tools. Projects from all six sections
will be graded together by the TAs.
Eric Breimer
2/6/2000