Workshop on Frequent Itemset Mining Implementations (FIMI'03)
19 November 2003, Melbourne, Florida, USA
in conjunction with ICDM'03
Submission Rules |
Frequently Asked Questions
All submissions to the workshop must adhere to the following
- All submitted code should use the ANSI C/C++ programming language.
The target compiler and platform will be g++ and Linux respectively.
The target platform will be a 3.2GHz Pentium 4 with 1GB memory running RedHat Linux 2.4.22 and a
WesternDigital 200GB/7200rpms disk.
For stress tests, we will also try several other (older) platforms
with limited memory, for example, 450Mhz/P3 with 256MB RAM.
We will perturb the IBM data generator along several dimensions for stress tests of the algorithms.
Source code that is accepted will be made publicly available
(via a web link or source code) on the FIM algorithm repository.
Note that we are flexible about the licensing issues.
The base line is that all code must be made freely available for research purposes.
The authors are free to add restrictions for commercial uses.
Thus the code may use GPL/Lesser GPL licensing or an NDA
(non disclosure agreement) for research purposes.
- The code must be accompanied by a paper describing the implementation,
with a detailed performance study on public datasets (links will be provided),
and a detailed explanation on the efficacy/effectiveness/efficiency of the implementation.
- Any existing (including variants) or new algorithm can be
implemented for the FIM tasks (all, closed and maximal set
mining). As mentioned above, the authors must describe why the
techniques used are likely to be efficient/effective.
- The code must follow minimal standards of documentation and must be understandable.
All submissions should contain a single Makefile
which creates an executable 'fim_all' (for all pattern mining),
'fim_closed' (for closed pattern mining) and 'fim_maximal' (for maximal
The program should be executable with 3 parameters:
- The path of the dataset-file
- The minimal support threshold, which should be a non-negative integer.
An itemset is frequent if its absolute support is larger or equal to this threshold.
- The output file.
This parameter should be optional! I.e., if it is not provided,
the frequent sets and their supports will not be sent to any output.
(also see FAQ)
- The dataset input must use the following ascii format only!
Each transaction is stored on a separate line as a list of items separated by
white space and ending with a newline. Each item is a non-negative integer.
The items in the test datasets will be consecutively numbered starting from 0, and each transaction
will be sorted ascending. (Note that this is not yet the case for the provided test datasets.)
For this, the Data class could be used, which is provided here.
These database classes are provided for uniformity and consistency reasons.
However, we understand that some algorithms depend on specialized database I/O classes
which may not be covered by the provided utilities.
If well-motivated, we will allow groups to use their own data classes.
For fairness of comparison, the use of multi-threading, advanced pipelining, low level memory optimizations,
or direct expolitation of the hardware, etc. are disallowed!
After execution of the program, the output file (if given as third parameter) should contain
all frequent itemsets together with their absolute support.
For this, the FSout class could be used, which is provided here.
Each line contains a single frequent itemset as a list of items separated by whitespace.
At the end of the line, its absolute support is printed between parantheses.
1 2 3 (100)
represents the itemset containing items 1,2 and 3 and it has an absolute support of 100.
Note that also the empty set is considered as an itemset and its support should also be printed
on a separate line in the output file.
During or after execution, the program must print
the number of frequent itemsets for each length between
0 and k separately to the standard output stream,
with k the size of the largest (closed/maximal) frequent set.
If there is no maximal/closed frequent itemset for a given length, print 0!
means that there is 1 frequent itemset of length 0, 5 frequent itemsets of length 1,
10 of length 2, 10 of length 3, 5 of length 4 and 1 of length 5.
No other output is printed to the standard output stream.
Frequently Asked Qestions
- If the minimal support threshold is set too low,
the number of frequent sets can become too large to
be send to output. Shouldn't output of the frequent sets be optional?
Although FIM algorithms are meaningless if they do not write the
wanted sets to output, it seems several people are also interested
in the performance results when no frequent itemsets are sent to
the output file. Therefore, the third parameter to the program
can be considered optional, i.e., if it is not given, no itemsets
are written to any output file. (also see submission rules)
- Can we use temporary files?
- Can we assume that items are consecutively numbered?
Yes, the items in the test datasets will be consecutively numbered starting from 0.
(Note that this is not yet the case for the provided test datasets.)
- Can we assume that each transaction is sorted?
Yes, each transaction will be sorted ascending.
(Note that this is not yet the case for some of the provided test datasets.)
- Can we assume that an item is contained at most once in each transaction?
(Note that this iss not the case for some of the provided test datasets.)
- According to the submission rules we should perform
"a detailed performance study on public datasets".
Do we have to present experiments on all datasets that are
and will be provided?
No, the experiments should be sufficiently illustrative.
You do not have to use all provided datasets and you can also use your own datasets.
In the latter case, we request that you send us all used datasets to be published
on the FIMI dataset repository.
- Do our implementations have to be object oriented?
No, they should only be compilable with an ANSI C/C++ compiler, such as g++.
- We have our own implementation of some of the well known FIM algorithms,
but we don't have any specific 'new' contributions, can we submit them?
We welcome such contributions. We will select the best implementations
out of all submissions to be posted on the archive.
We expect that these implementations will get a lot of exposure over time.
Do we have to submit an implementation for all three FIM tasks, i.e. all, closed and maximal sets?
No, you can choose which category or categories you want us to consider your code for.