Software
Disclaimer: The software on this page is provided on an as is basis for research purposes. There is no additional support offered, nor are the author(s) or their institutions liable under any circumstances.
Acknowledgments
We gratefully acknowledge the funding from the following agencies/programs that made the research possible:
- NSF Information and Data Management program (Grant: IIS-0092978)
- NSF Next Generation Software program (Grant: EIA-0103708)
- DOE Office of Science (Grant: DE-FG02-02ER25538)
- CIA/NSA/DNI/IARPA Knowledge Discovery and Dissemination Program (Grants: EIA-0225715, ACI-0342411, CNS-0332960, CNS-0422637, CNS-0540232, IIS-0830218)
- NSF Emerging Models and Technologies for Computation program (Grant: EMT-0829835)
- NIH Biomedical Imaging and Bioengineering Institute (Grant: 1R01EB0080161-02)
- HP Innovation Research Program
Table of Contents (hide)
Data Mining Template Library (DMTL)
DMTL is an open-source, high-performance, generic data mining toolkit, written in C++.
It provides a collection of generic algorithms and data structures for mining increasingly
complex and informative patterns types such as: Itemsets, Sequences, Trees and Graphs.
DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. The kind of pattern to be mined, the kind of mining approach to use, and the kind of data types and formats to mine over are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications.
DMTL implements all algorithms following the vertical data mining approach. For itemsets, the implementation follows the Eclat approach (without diffsets). For sequences it follows the Spade approach. For trees it implements the SLEUTH framework, which allows one to mine embedded/induced and ordered/unordered trees. Finally, the graph mining framework uses a novel vertical approach.
- Download
- Relevant Publications
Itemset Mining and Association Rules
The section contains the code for mining all frequent itemsets, all closed itemsets, and all maximal frequent itemsets. It also includes the code for constructing the concept lattice (or iceberg lattice) and generating nonredundant association rules.
Frequent Itemsets (Eclat)
Eclat uses the original vertical tidset approach for mining all frequent itemsets (1997-eclat), combined with the diffsets improvement (2003-diffsets). The code also contains the maxeclat, clique, and maxclique approaches mentioned in (2000-eclat:tkde).
- Download
- Eclat code
- You also need to download utilities like getconf, exttpose from Utilities section.
- Relevant Publications
- Mohammed J. Zaki and Karam Gouda, Fast Vertical Mining Using Diffsets. In 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Aug 2003. (PDF) (BibTeX)
- Mohammed J. Zaki, Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390. May/Jun 2000. (PDF) (BibTeX)
Closed Itemsets (Charm, Charm-L) and Non-redundant Rule Generation
Charm mines all the frequent closed itemsets as described in (2002-charm). Charm-L adds the ability to construct the entire frequent concept lattice (also called as the iceberg lattice), that is, it adds the links between all sub/super-concepts (or closed itemsets) (2005-charm:tkde). This ability is used to mine the non-redundant association rules (2004-nonredundant:dmkd).
- Download
- Charm code
- Charm-L and Non-redundant Rules code
- You also need to download utilities like getconf, exttpose from Utilities section.
- Relevant Publications
- Mohammed J. Zaki and Ching-Jui Hsiao, Efficient Algorithms for Mining Closed Itemsets and their Lattice Structure. IEEE Transactions on Knowledge and Data Engineering, 17(4):462-478. Apr 2005. (PDF) (BibTeX)
- Mohammed J. Zaki, Mining Non-Redundant Association Rules. Data Mining and Knowledge Discovery: An International Journal, 9(3):223-248. Nov 2004. (PDF) (BibTeX)
Maximal Itemsets (GenMax)
Genmax mines all maximal frequent itemsets via a backtracking approach with progressive focusing.
- Download
- GenMax code
- You also need to download utilities like getconf, exttpose from Utilities section.
- Relevant Publications
Frequent Boolean Expressions (BLOSOM)
The BLOSOM framework allows one to mine arbitrary frequent boolean expressions include AND clauses (itemsets), OR clauses, and CNF/DNF expressions. It focuses on mining the minimal boolean expressions.
- Download
- BLOSOM code: mug find minimal or-clauses, and-clauses, CNF and DNF expressions, xng finds closed CNF expressions and xug find closed DNF expressions.
- Charm-L: this code can mine all minimal and closed and-clauses, i.e., all minimal and closed frequent itemsets.
- Relevant Publications
Sampling Minimal Boolean Expressions (minDNF)
The minDNF method samples minimal boolean expressions in DNF.
- Download
Sequence Mining
Frequent Sequences (Spade)
Spade uses the vertical format for mining the set of all frequent sequences from a dataset of many sequences. It also mines the frequent sequences of itemsets.
- Download
- Spade code
- You also need to download utilities like getconf, exttpose from Utilities section.
- Relevant Publications
Sequence Constraints (cSpade)
cSpade mines constrained frequent sequences. The constraints can take the form of length or width limitations on the sequences, minimum or maximum gap constraints on consecutive sequence elements, applying a time window on allowable sequences, incorporating item constraints, and finding sequences predictive of one or more classes. The class specific sequences can be used for sequence classification as described in {(:cell PQA(PSS(2000-featuremine:is):)}.
- Download
- cSpade code
- cSpade (win) code: same as above, but for the Windows platform (Courtesy: Daniel Diaz, University of Paris 1 - Pantheon Sorbonne).
- aRulesSequences (R): R package that contains the cSPADE code (Courtesy: Christian Buchta and Michael Hahsler, Vienna University of Economics and Business Administration).
- You also need to download utilities like getconf, exttpose from Utilities section.
- Relevant Publications
Itemset and Sequence Utilities
Provides the utilities needed for Eclat, Charm/Charm-L, GenMax, Spade and cSpade.
- Download
- Utilities code: provides exttpose, getconf, and makebin
- Utilities (win) code: same as above, but for the Windows platform (Courtesy: Daniel Diaz, University of Paris 1 - Pantheon Sorbonne).
Tree Mining
TreeMiner and TreeMiner-D
Treeminer uses the vertical scope-list representation to mine frequent sequences (2005-treeminer:tkde). Treeminer counts all embeddings, whereas Treeminer-D counts only distinct occurrences, which can be more appropriate for some datasets.
- Download
- TreeMiner code: vtreeminer is the TreeMiner code, whereas htreeminer is the PatternMatcher code as mentioned in the paper below.
- TreeMiner-D code
- Relevant Publications
SLEUTH
SLEUTH extends the TreeMiner methodology to mine all frequent embedded or induced as well as ordered or unordered tree patterns.
- Download
- Relevant Publications
XML Classification (XRules)
XRules uses frequent tree mining to mine discriminative patterns from tree-structured XML documents.
- Download
- Relevant Publications
Graph Mining & Indexing
Representative Orthogonal Graph Mining (Origami)
Origami uses random walks over the graph partial order to mine a representative sample of the maximal frequent subgraph patterns. It next selects only the orthogonal representative patterns via a local clique-finding method.
- Download
- Relevant Publications
Graph Pattern Sampling (Output Space Sampling)
Whereas Origami can mine a sample of maximal graph patterns it does not provide any uniformity guarantee. MUSK (2009-musk) proposes a Markov Chain Monte Carlo based approach to guarantee a uniform sample of all maximal patterns. The MCMC approach was further extended in (2009-graphsampling) to mine a sample of all frequent patterns, to mine support-biased patterns and also to mine a sample of discriminative patterns.
- Download
- Graph Sampling code: for all, support-biased and discriminative graph pattern sampling
- Relevant Publications
- Mohammad Al Hasan and Mohammed J. Zaki, MUSK: Uniform Sampling of k Maximal Patterns. In 9th SIAM International Conference on Data Mining. Apr 2009. (PDF) (BibTeX)
- Mohammad Al Hasan and Mohammed J. Zaki, Output Space Sampling for Graph Patterns. Proceedings of the VLDB Endowment (35th International Conference on Very Large Data Bases), 2(1):730-741. 2009. (PDF) (BibTeX)
Scalable Graph Reachability Indexing (GRAIL)
GRAIL uses random multiple interval labelings, and a variety of optimizations to perform rapid reachability testing in very large graphs (with millions of nodes and edges).
- Download
- Relevant Publications
Clustering
This section contains code for mining categorical subspace clusters, shape based clusters, for clustering based on a lower bound on similarity, and a new outlier-based approach for initial cluster seed selection.
Interesting Subspace Mining (SCHISM)
SCHISM finds interesting subspace clusters.
- Download
- Relevant Publications
- Karlton Sequeira and Mohammed J. Zaki, SCHISM: A New Approach for Interesting Subspace Mining. In 4th IEEE International Conference on Data Mining. Nov 2004. (PDF) (BibTeX)
- Karlton Sequeira and Mohammed J. Zaki, SCHISM: A New Approach to Interesting Subspace Mining. International Journal of Business Intelligence and Data Mining, 1(2):137-160. 2005. (PDF) (BibTeX)
Categorical Subspace Clustering (CLICKS)
CLICKS finds subspace clusters in categorical data using a k-partite clique mining approach.
- Download
- Relevant Publications
Arbitrary Shape Clustering (Sparcl)
Sparcl finds shape-based clusters. It uses a two step approach: in the first step we select a relatively large number of candidate centroids (via ROBIN) to find seed clusters via the K-means algorithm and in the second step we use a novel similarity kernel to merge the initial seed clusters to yield the final arbitrary shaped clusters.
- Download
- Relevant Publications
K-means Initialization (Robin)
ROBIN uses a new local outllier factor based initial seed selection to improve k-means style clustering.
- Download
- Relevant Publications
Microarray Gene Expression Clustering (TriCluster and MicroCluster)
Tricluster is the first tri-clustering algorithm for microarray expression clustering. It builds upon the new microCluster bi-clustering approach. Tricluster first mines all the bi-clusters across the gene-sample slices, and then it extends these into tri-clusters across time or space (depending on the third dimension). It can find both scaling and shifting patterns.
- Download
- Relevant Publications
- Lizhuang Zhao and Mohammed J. Zaki, TriCluster: An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data. In ACM SIGMOD Conference on Management of Data. Jun 2005. (PDF) (BibTeX)
- Lizhuang Zhao and Mohammed J. Zaki, MicroCluster: An Efficient Deterministic Biclustering Algorithm for Microarray Data. IEEE Intelligent Systems, 20(6):40-49. Nov/Dec 2005. (PDF) (BibTeX)
Biological Sequence Analysis
This section contains code for sequence modeling via Hidden Markov Models, code for structured motif extraction and search, and genome-scale disk-based suffix tree indexing.
Hidden Markov Models: Variable Order HMM with Duration (VOGUE)
VOGUE is a variable order and gapped HMM with with duration. It uses sequence mining to extract frequent patterns in the data. It then uses these patterns to build a variable order HMM with explicit duration on the gap states, for sequence modeling and classification. VOGUE was applied to model protein sequences, as well as a number of other sequence datasets including weblogs.
- Download
- Relevant Publications
Genome Scale Indexing: Disk-based Suffix Trees (Trellis and Trellis+)
Trellis is a disk-based suffix tree indexing methods (with suffix links) that is capable of indexing the entire human genome on a commodity PC with limited memory. Trellis+ extends Trellis by further removing some memory limitations by using a novel guide suffix tree in memory.
- Download
- Trellis code: input sequence must be in memory
- Trellis+ code: removes all memory limitations
- Relevant Publications
- Benjarath Phoophakdee and Mohammed J. Zaki, Genome-scale Disk-based Suffix Tree Indexing. In ACM SIGMOD International Conference on Management of Data. Jun 2007. (PDF) (BibTeX)
- Benjarath Phoophakdee and Mohammed J. Zaki, TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences using Suffix Trees. In 13th Pacific Symposium on Biocomputing. Jan 2008. (PDF) (BibTeX)
Structured Sequence Motifs: Search and Extraction (sMotif and exMotif)
sMotif and exMotif are two complementary for searching and extracting/mining structured sequence motifs DNA sequences. A structured motif consists of simple motifs separated by different gap lengths. The simple motif may be a simple pattern or a position weighted matrix or profile. Given a template structured motif (pattern or profile), sMotif finds all matches in a given set of sequences. On the other hand, exMotif mines novel motifs matching some minimal conditions on the gaps and frequency.
- Download
- exMotif code: for structured motif extraction
- sMotif code: for structured motif search
- Relevant Publications
- Yongqiang Zhang and Mohammed J. Zaki, EXMOTIF: efficient structured motif extraction. Algorithms for molecular biology, 1(21). Nov 2006. ((URL)) (PDF) (BibTeX)
- Yongqiang Zhang and Mohammed J. Zaki, SMOTIF: efficient structured pattern and profile motif search. Algorithms for molecular biology, 1(22). Nov 2006. ((URL)) (PDF) (BibTeX)
Protein Structure
Flexible and Non-sequential Protein Structure Alignment (SNAP/STSA and FlexSnap)
SNAP finds non-sequential 3D protein structure alignments. It was initially called STSA (2008-snap). FlexSnap allows the ability to find both flexible and non-sequential allignments.
- Download
- Relevant Publications
- Saeed Salem, Mohammed J. Zaki and Chris Bystroff, Iterative Non-Sequential Protein Structural Alignment. Journal of Bioinformatics and Computational Biology, 7(3):571-596. Jun 2009. (PDF) (BibTeX)
- Saeed Salem, Mohammed J. Zaki and Chris Bystroff, FlexSnap: Flexible Non-Sequential Protein Structure Alignment. In 9th Workshop on Algorithms in Bioinformatics. Sep 2009. (PDF) (BibTeX)
Protein Docking and Partial Shape Matching (ContextShapes)
ContextShapes does rigid-body protein docking. It uses a novel contextshapes data structure to represent local surface regions/shapes on the protein. All critical points on both the receptor and ligand are represented via context shapes, and the best docking is found via pair-wise matching.
- Download
- Relevant Publications
Protein Indexing (PSIST)
PSIST uses suffix trees to index protein 3D structure. It first converts the 3D structure into a structure-feature sequence over a new structural alphabet, which is then used to index protein structures. The PSIST index makes it very fast to query for a matching structural fragment.
- Download
- Relevant Publications
Protein Folding Pathways (UNFOLD)
UNFOLD uses a recursive min-cut on a weighted secondary structure element graph to predict the sequence of protein (un)folding events.
- Download
- Relevant Publications
Real and Synthetic Datasets
The section contains various synthetic and real datasets used in some of the papers related to itemset, sequence, tree and XML mining.
- Download
- IBM Datagen program: contains the IBM synthetic dataset generator for itemset patterns
- Tree Generator: contains the synthetic tree generator described in (2005-treeminer:tkde)
- Real Datasets: contains various real itemset datasets like chess, connect, mushroom, pumsb and so on, used in the papers on frequent, closed and maximal itemset mining
- CSLOGS data: The CSLOGS data was used for (2005-treeminer:tkde)
- Xrules Log Data: The log data used for XML classification in (2006-xrules:mlj)
- Xrules synthetic datasets: The synthetic classification data used for XML classification in (2006-xrules:mlj)
- Plan dataset: Planning dataset for sequence mining


