Workshop on Frequent Itemset Mining Implementations (FIMI'03)

19 November 2003, Melbourne, Florida, USA
in conjunction with ICDM'03


Submission Rules | Utilities | Frequently Asked Questions

Submission Rules

All submissions to the workshop must adhere to the following guidelines.
  • All submitted code should use the ANSI C/C++ programming language.
    The target compiler and platform will be g++ and Linux respectively.
    The target platform will be a 3.2GHz Pentium 4 with 1GB memory running RedHat Linux 2.4.22 and a WesternDigital 200GB/7200rpms disk.
    For stress tests, we will also try several other (older) platforms with limited memory, for example, 450Mhz/P3 with 256MB RAM.
    We will perturb the IBM data generator along several dimensions for stress tests of the algorithms.

  • Source code that is accepted will be made publicly available (via a web link or source code) on the FIM algorithm repository.
    Note that we are flexible about the licensing issues. The base line is that all code must be made freely available for research purposes. The authors are free to add restrictions for commercial uses. Thus the code may use GPL/Lesser GPL licensing or an NDA (non disclosure agreement) for research purposes.

  • The code must be accompanied by a paper describing the implementation, with a detailed performance study on public datasets (links will be provided), and a detailed explanation on the efficacy/effectiveness/efficiency of the implementation.

  • Any existing (including variants) or new algorithm can be implemented for the FIM tasks (all, closed and maximal set mining). As mentioned above, the authors must describe why the techniques used are likely to be efficient/effective.

  • The code must follow minimal standards of documentation and must be understandable.

  • All submissions should contain a single Makefile which creates an executable 'fim_all' (for all pattern mining), 'fim_closed' (for closed pattern mining) and 'fim_maximal' (for maximal pattern mining).

  • The program should be executable with 3 parameters:
    1. The path of the dataset-file
    2. The minimal support threshold, which should be a non-negative integer. An itemset is frequent if its absolute support is larger or equal to this threshold.
    3. The output file.
      This parameter should be optional! I.e., if it is not provided, the frequent sets and their supports will not be sent to any output. (also see FAQ)

  • The dataset input must use the following ascii format only!
    Each transaction is stored on a separate line as a list of items separated by white space and ending with a newline. Each item is a non-negative integer.
    The items in the test datasets will be consecutively numbered starting from 0, and each transaction will be sorted ascending. (Note that this is not yet the case for the provided test datasets.)
    For this, the Data class could be used, which is provided here.
    These database classes are provided for uniformity and consistency reasons.
    However, we understand that some algorithms depend on specialized database I/O classes which may not be covered by the provided utilities.
    If well-motivated, we will allow groups to use their own data classes.
    For fairness of comparison, the use of multi-threading, advanced pipelining, low level memory optimizations, or direct expolitation of the hardware, etc. are disallowed!

  • After execution of the program, the output file (if given as third parameter) should contain all frequent itemsets together with their absolute support.

    For this, the FSout class could be used, which is provided here.

    Each line contains a single frequent itemset as a list of items separated by whitespace. At the end of the line, its absolute support is printed between parantheses.
    For example:

    1 2 3 (100)

    represents the itemset containing items 1,2 and 3 and it has an absolute support of 100.
    Note that also the empty set is considered as an itemset and its support should also be printed on a separate line in the output file.

  • During or after execution, the program must print the number of frequent itemsets for each length between 0 and k separately to the standard output stream, with k the size of the largest (closed/maximal) frequent set.
    If there is no maximal/closed frequent itemset for a given length, print 0!
    For example:

    1
    5
    10
    10
    5
    1

    means that there is 1 frequent itemset of length 0, 5 frequent itemsets of length 1, 10 of length 2, 10 of length 3, 5 of length 4 and 1 of length 5.

    No other output is printed to the standard output stream.

Utilities

Frequently Asked Qestions

  • If the minimal support threshold is set too low, the number of frequent sets can become too large to be send to output. Shouldn't output of the frequent sets be optional?

    Although FIM algorithms are meaningless if they do not write the wanted sets to output, it seems several people are also interested in the performance results when no frequent itemsets are sent to the output file. Therefore, the third parameter to the program can be considered optional, i.e., if it is not given, no itemsets are written to any output file. (also see submission rules)

  • Can we use temporary files?

    Yes.

  • Can we assume that items are consecutively numbered?

    Yes, the items in the test datasets will be consecutively numbered starting from 0.
    (Note that this is not yet the case for the provided test datasets.)

  • Can we assume that each transaction is sorted?

    Yes, each transaction will be sorted ascending.
    (Note that this is not yet the case for some of the provided test datasets.)

  • Can we assume that an item is contained at most once in each transaction?

    Yes.
    (Note that this iss not the case for some of the provided test datasets.)

  • According to the submission rules we should perform "a detailed performance study on public datasets".
    Do we have to present experiments on all datasets that are and will be provided?


    No, the experiments should be sufficiently illustrative.
    You do not have to use all provided datasets and you can also use your own datasets.
    In the latter case, we request that you send us all used datasets to be published on the FIMI dataset repository.

  • Do our implementations have to be object oriented?

    No, they should only be compilable with an ANSI C/C++ compiler, such as g++.

  • We have our own implementation of some of the well known FIM algorithms,
    but we don't have any specific 'new' contributions, can we submit them?


    We welcome such contributions. We will select the best implementations out of all submissions to be posted on the archive.
    We expect that these implementations will get a lot of exposure over time.

  • Do we have to submit an implementation for all three FIM tasks, i.e. all, closed and maximal sets?

    No, you can choose which category or categories you want us to consider your code for.