Software

Disclaimer: The software on this page is provided on an as is basis for research purposes. There is no additional support offered, nor are the author(s) or their institutions liable under any circumstances.

Acknowledgments

We gratefully acknowledge the funding from the following agencies/programs that made the research possible:

  • NSF Information and Data Management program (Grant: IIS-0092978)
  • NSF Next Generation Software program (Grant: EIA-0103708)
  • DOE Office of Science (Grant: DE-FG02-02ER25538)
  • CIA/NSA/DNI/IARPA Knowledge Discovery and Dissemination Program (Grants: EIA-0225715, ACI-0342411, CNS-0332960, CNS-0422637, CNS-0540232, IIS-0830218)
  • NSF Emerging Models and Technologies for Computation program (Grant: EMT-0829835)
  • NIH Biomedical Imaging and Bioengineering Institute (Grant: 1R01EB0080161-02)
  • HP Innovation Research Award
  • Google Faculty Research Award
  • NVIDIA Academic Partnership Award




Data Mining Template Library (DMTL)


DMTL is an open-source, high-performance, generic data mining toolkit, written in C++. It provides a collection of generic algorithms and data structures for mining increasingly complex and informative patterns types such as: Itemsets, Sequences, Trees and Graphs.

DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. The kind of pattern to be mined, the kind of mining approach to use, and the kind of data types and formats to mine over are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications.

DMTL implements all algorithms following the vertical data mining approach. For itemsets, the implementation follows the Eclat approach (without diffsets). For sequences it follows the Spade approach. For trees it implements the SLEUTH framework, which allows one to mine embedded/induced and ordered/unordered trees. Finally, the graph mining framework uses a novel vertical approach.

  • Download
  • Relevant Publications
    • Vineet Chaoji, Mohammad Al Hasan, Saeed Salem and Mohammed J. Zaki, An integrated, generic approach to pattern mining: data mining template library. Data Mining and Knowledge Discovery, 17(3):457-495. Dec 2008. (PDF) (BibTeX)

↑ Back to Table of Contents



Itemset Mining and Association Rules

The section contains the code for mining all frequent itemsets, all closed itemsets, and all maximal frequent itemsets. It also includes the code for constructing the concept lattice (or iceberg lattice) and generating nonredundant association rules.

Frequent Itemsets (Eclat)

Eclat uses the original vertical tidset approach for mining all frequent itemsets (1997-eclat), combined with the diffsets improvement (2003-diffsets). The code also contains the maxeclat, clique, and maxclique approaches mentioned in (2000-eclat:tkde).

  • Download
  • Relevant Publications
    • Mohammed J. Zaki and Karam Gouda, Fast Vertical Mining Using Diffsets. In 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Aug 2003. (PDF) (BibTeX)
    • Mohammed J. Zaki, Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390. May/Jun 2000. (PDF) (BibTeX)
    • Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara and Wei Li, New algorithms for fast discovery of association rules. In 3rd International Conference on Knowledge Discovery and Data Mining (KDD). Aug 1997. (PDF) (BibTeX)



Closed Itemsets (Charm, Charm-L) and Non-redundant Rule Generation

Charm mines all the frequent closed itemsets as described in (2002-charm). Charm-L adds the ability to construct the entire frequent concept lattice (also called as the iceberg lattice), that is, it adds the links between all sub/super-concepts (or closed itemsets) (2005-charm:tkde). This ability is used to mine the non-redundant association rules (2004-nonredundant:dmkd).

  • Download
  • Relevant Publications
    • Mohammed J. Zaki and Ching-Jui Hsiao, Efficient Algorithms for Mining Closed Itemsets and their Lattice Structure. IEEE Transactions on Knowledge and Data Engineering, 17(4):462-478. Apr 2005. (PDF) (BibTeX)
    • Mohammed J. Zaki, Mining Non-Redundant Association Rules. Data Mining and Knowledge Discovery: An International Journal, 9(3):223-248. Nov 2004. (PDF) (BibTeX)
    • Mohammed J. Zaki and Ching-Jui Hsiao, CHARM: An Efficient Algorithm for Closed Itemset Mining. In 2nd {SIAM} International Conference on Data Mining. Apr 2002. (PDF) (BibTeX)



Maximal Itemsets (GenMax)

Genmax mines all maximal frequent itemsets via a backtracking approach with progressive focusing.

  • Download
  • Relevant Publications
    • Karam Gouda and Mohammed J. Zaki, GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets. Data Mining and Knowledge Discovery: An International Journal, 11(3):223-242. Nov 2005. (PDF) (BibTeX)
    • Karam Gouda and Mohammed J. Zaki, Efficiently Mining Maximal Frequent Itemsets. In 1st IEEE International Conference on Data Mining. Nov 2001. (PDF) (BibTeX)



Frequent Boolean Expressions (BLOSOM)

The BLOSOM framework allows one to mine arbitrary frequent boolean expressions include AND clauses (itemsets), OR clauses, and CNF/DNF expressions. It focuses on mining the minimal boolean expressions.

  • Download
    • BLOSOM code: mug find minimal or-clauses, and-clauses, CNF and DNF expressions, xng finds closed CNF expressions and xug find closed DNF expressions.
    • Charm-L: this code can mine all minimal and closed and-clauses, i.e., all minimal and closed frequent itemsets.
  • Relevant Publications
    • Lizhuang Zhao, Mohammed J. Zaki and Naren Ramakrishnan, BLOSOM: A Framework for Mining Arbitrary Boolean Expressions. In 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Aug 2006. (PDF) (BibTeX)
    • Mohammed J. Zaki and Naren Ramakrishnan and Lizhuang Zhao, Mining Frequent Boolean Expressions: Application to Gene Expression and Regulatory Modeling. International Journal of Knowledge Discovery in Bioinformatics, 1(3):68-96. Sep 2010. (PDF) (BibTeX)

Sampling Minimal Boolean Expressions (minDNF)

The minDNF method samples minimal boolean expressions in DNF.

  • Download
  • Relevant Publications
    • Geng Li and Mohammed J. Zaki, Sampling Minimal Frequent Boolean (DNF) Patterns. In 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Aug 2012. (PDF) (BibTeX)

↑ Back to Table of Contents



Sequence Mining


Frequent Sequences (Spade)

Spade uses the vertical format for mining the set of all frequent sequences from a dataset of many sequences. It also mines the frequent sequences of itemsets.

  • Download
  • Relevant Publications
    • Mohammed J. Zaki, SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal, 42(1/2):31-60. Jan/Feb 2001. (PDF) (BibTeX)]
    • Mohammed J. Zaki, Efficient Enumeration of Frequent Sequences. In 7th ACM International Conference on Information and Knowledge Management. Nov 1998. (PDF) (BibTeX)]



Sequence Constraints (cSpade)

cSpade mines constrained frequent sequences. The constraints can take the form of length or width limitations on the sequences, minimum or maximum gap constraints on consecutive sequence elements, applying a time window on allowable sequences, incorporating item constraints, and finding sequences predictive of one or more classes. The class specific sequences can be used for sequence classification as described in (2000-featuremine:is).

  • Download
    • cSpade code
    • cSpade (win) code: same as above, but for the Windows platform (Courtesy: Daniel Diaz, University of Paris 1 - Pantheon Sorbonne).
    • aRulesSequences (R): R package that contains the cSPADE code (Courtesy: Christian Buchta and Michael Hahsler, Vienna University of Economics and Business Administration).
    • You also need to download utilities like getconf, exttpose from Utilities section.
  • Relevant Publications
    • Mohammed J. Zaki, Sequences Mining in Categorical Domains: Incorporating Constraints. In 9th ACM International Conference on Information and Knowledge Management. Nov 2000. (PDF) (BibTeX)
    • Neal Lesh, Mohammed J. Zaki and Mitsunori Ogihara, Scalable Feature Mining for Sequential Data. IEEE Intelligent Systems and their Applications, 15(2):48-56. Mar/Apr 2000. (PDF) (BibTeX)



PRISM Algorithm: Sequence Mining

PRISM mines frequent sequences using the block-encoding approach as described in (2010-prism:jcss).

  • Download
  • Relevant Publications
    • Karam Gouda, Mosab Hassaan and Mohammed J. Zaki, PRISM: An Effective Approach for Frequent Sequence Mining via Prime-Block Encoding. Journal of Computer and Systems Sciences, 76(1):88-102. Feb 2010. (PDF) (BibTeX)
    • Karam Gouda, Mosab Hassaan and Mohammed J. Zaki, PRISM: A Prime-Encoding Approach for Frequent Sequence Mining. In 7th IEEE International Conference on Data Mining. Oct 2007. (PDF) (BibTeX)

↑ Back to Table of Contents



Itemset and Sequence Utilities

Provides the utilities needed for Eclat, Charm/Charm-L, GenMax, Spade and cSpade.

  • Download
    • Utilities code: provides exttpose, getconf, and makebin
    • Utilities (win) code: same as above, but for the Windows platform (Courtesy: Daniel Diaz, University of Paris 1 - Pantheon Sorbonne).

↑ Back to Table of Contents



Tree Mining


TreeMiner and TreeMiner-D

Treeminer uses the vertical scope-list representation to mine frequent sequences (2005-treeminer:tkde). Treeminer counts all embeddings, whereas Treeminer-D counts only distinct occurrences, which can be more appropriate for some datasets.

  • Download
  • Relevant Publications
    • Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering, 17(8):1021-1035. Aug 2005. (PDF) (BibTeX)
    • Mohammed J. Zaki, Efficiently Mining Frequent Trees in a Forest. In 8th ACM SIGKDD International Conference Knowledge Discovery and Data Mining. Jul 2002. (PDF) (BibTeX)



SLEUTH

SLEUTH extends the TreeMiner methodology to mine all frequent embedded or induced as well as ordered or unordered tree patterns.

  • Download
  • Relevant Publications
    • Mohammed J. Zaki, Efficiently Mining Frequent Embedded Unordered Trees. Fundamenta Informaticae, 66(1-2):33-52. Mar/Apr 2005. (PDF) (BibTeX)



XML Classification (XRules)

XRules uses frequent tree mining to mine discriminative patterns from tree-structured XML documents.

  • Download
  • Relevant Publications
    • Mohammed J. Zaki and Charu C. Aggarwal, Xrules: An effective structural classifier for XML data. Machine Learning Journal, 62(1-2):137-170. Feb 2006. (PDF) (BibTeX)

↑ Back to Table of Contents



Graph Mining & Indexing


Representative Orthogonal Graph Mining (Origami)

Origami uses random walks over the graph partial order to mine a representative sample of the maximal frequent subgraph patterns. It next selects only the orthogonal representative patterns via a local clique-finding method.

  • Download
  • Relevant Publications
    • Vineet Chaoji, Mohammad Al Hasan, Saeed Salem , Jeremy Besson and Mohammed J. Zaki, ORIGAMI: A Novel and Effective Approach for Mining Representative Orthogonal Graph Patterns. Statistical Analysis and Data Mining, 1(2):67-84. Jun 2008. (PDF) (BibTeX)
    • Mohammad Hasan, Vineet Chaoji, Saeed Salem, Jeremy Besson and Mohammed J. Zaki, ORIGAMI: Mining Representative Orthogonal Graph Patterns. In 7th IEEE International Conference on Data Mining. Oct 2007. (PDF) (BibTeX)



Graph Pattern Sampling (Output Space Sampling)

Whereas Origami can mine a sample of maximal graph patterns it does not provide any uniformity guarantee. MUSK (2009-musk) proposes a Markov Chain Monte Carlo based approach to guarantee a uniform sample of all maximal patterns. The MCMC approach was further extended in (2009-graphsampling) to mine a sample of all frequent patterns, to mine support-biased patterns and also to mine a sample of discriminative patterns.

  • Download
  • Relevant Publications
    • Mohammad Al Hasan and Mohammed J. Zaki, MUSK: Uniform Sampling of k Maximal Patterns. In 9th SIAM International Conference on Data Mining. Apr 2009. (PDF) (BibTeX)
    • Mohammad Al Hasan and Mohammed J. Zaki, Output Space Sampling for Graph Patterns. Proceedings of the VLDB Endowment (35th International Conference on Very Large Data Bases), 2(1):730-741. 2009. (PDF) (BibTeX)



]
Scalable Graph Reachability Indexing (GRAIL)

GRAIL uses random multiple interval labelings, and a variety of optimizations to perform rapid reachability testing in very large graphs (with millions of nodes and edges).

  • Download
  • Relevant Publications
    • Hilmi Yildirim, Vineet Chaoji and Mohammed J. Zaki, GRAIL: A Scalable Index for Reachability Queryies in Very Large Graphs. The VLDB Journal, 21(4):509-534. Aug 2012. ((URL)) (PDF) (BibTeX)
    • Hilmi Yildirim, Vineet Chaoji and Mohammed J. Zaki, GRAIL: Scalable Reachability Index for Large Graphs. Proceedings of the VLDB Endowment (36th International Conference on Very Large Data Bases), 3(1):276-284. 2010. (PDF) (BibTeX)



↑ Back to Table of Contents



Clustering

This section contains code for mining categorical subspace clusters, shape based clusters, for clustering based on a lower bound on similarity, and a new outlier-based approach for initial cluster seed selection.

Interesting Subspace Mining (SCHISM)

SCHISM finds interesting subspace clusters.

  • Download
  • Relevant Publications
    • Karlton Sequeira and Mohammed J. Zaki, SCHISM: A New Approach for Interesting Subspace Mining. In 4th IEEE International Conference on Data Mining. Nov 2004. (PDF) (BibTeX)
    • Karlton Sequeira and Mohammed J. Zaki, SCHISM: A New Approach to Interesting Subspace Mining. International Journal of Business Intelligence and Data Mining, 1(2):137-160. 2005. (PDF) (BibTeX)



Categorical Subspace Clustering (CLICKS)

CLICKS finds subspace clusters in categorical data using a k-partite clique mining approach.

  • Download
  • Relevant Publications
    • Mohammed J. Zaki, Markus Peters, Ira Assent and Thomas Seidl, CLICKS: An Effective Algorithm for Mining Subspace Clusters in Categorical Datasets. Data and Knowledge Engineering, 60(1):51-70. Jan 2007. (PDF) (BibTeX)



Arbitrary Shape Clustering (Sparcl)

Sparcl finds shape-based clusters. It uses a two step approach: in the first step we select a relatively large number of candidate centroids (via ROBIN) to find seed clusters via the K-means algorithm and in the second step we use a novel similarity kernel to merge the initial seed clusters to yield the final arbitrary shaped clusters.

  • Download
  • Relevant Publications
    • Vineet Chaoji, Mohammad Al Hasan, Saeed Salem and Mohammed J. Zaki, SPARCL: An Effective and Efficient Algorithm for Mining Arbitrary Shape-based Clusters. Knowledge and Information Systems, 21(2):201-229. Nov 2009. (PDF) (BibTeX)
    • Vineet Chaoji, Mohammad Al Hasan, Saeed Salem and Mohammed J. Zaki, SPARCL: Efficient and Effective Shape-based Clustering. In 8th IEEE International Conference on Data Mining. Dec 2008. (PDF) (BibTeX)



K-means Initialization (Robin)

ROBIN uses a new local outllier factor based initial seed selection to improve k-means style clustering.

  • Download
  • Relevant Publications
    • Mohammad Al Hasan, Vineet Chaoji, Saeed Salem and Mohammed J. Zaki, Robust Partitional Clustering by Outlier and Density Insensitive Seeding. Pattern Recognition Letters, 30(11):994-1002. Aug 2009. (PDF) (BibTeX)



Microarray Gene Expression Clustering (TriCluster and MicroCluster)

Tricluster is the first tri-clustering algorithm for microarray expression clustering. It builds upon the new microCluster bi-clustering approach. Tricluster first mines all the bi-clusters across the gene-sample slices, and then it extends these into tri-clusters across time or space (depending on the third dimension). It can find both scaling and shifting patterns.

  • Download
  • Relevant Publications
    • Lizhuang Zhao and Mohammed J. Zaki, TriCluster: An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data. In ACM SIGMOD Conference on Management of Data. Jun 2005. (PDF) (BibTeX)
    • Lizhuang Zhao and Mohammed J. Zaki, MicroCluster: An Efficient Deterministic Biclustering Algorithm for Microarray Data. IEEE Intelligent Systems, 20(6):40-49. Nov/Dec 2005. (PDF) (BibTeX)

↑ Back to Table of Contents



Biological Sequence Analysis

This section contains code for sequence modeling via Hidden Markov Models, code for structured motif extraction and search, and genome-scale disk-based suffix tree indexing.

Hidden Markov Models: Variable Order HMM with Duration (VOGUE)

VOGUE is a variable order and gapped HMM with with duration. It uses sequence mining to extract frequent patterns in the data. It then uses these patterns to build a variable order HMM with explicit duration on the gap states, for sequence modeling and classification. VOGUE was applied to model protein sequences, as well as a number of other sequence datasets including weblogs.

  • Download
  • Relevant Publications
    • Mohammed J. Zaki, Christopher D. Carothers and Boleslaw K. Szymanski, VOGUE: A Variable Order Hidden Markov Model with Duration based on Frequent Sequence Mining. ACM Transactions on Knowledge Discovery in Data, 4(1):Article 5. Jan 2010. (PDF) (BibTeX)



Genome Scale Indexing: Disk-based Suffix Trees (Trellis and Trellis+)

Trellis is a disk-based suffix tree indexing methods (with suffix links) that is capable of indexing the entire human genome on a commodity PC with limited memory. Trellis+ extends Trellis by further removing some memory limitations by using a novel guide suffix tree in memory.

  • Download
  • Relevant Publications
    • Benjarath Phoophakdee and Mohammed J. Zaki, Genome-scale Disk-based Suffix Tree Indexing. In ACM SIGMOD International Conference on Management of Data. Jun 2007. (PDF) (BibTeX)
    • Benjarath Phoophakdee and Mohammed J. Zaki, TRELLIS+: An Effective Approach for Indexing Genome-Scale Sequences using Suffix Trees. In 13th Pacific Symposium on Biocomputing. Jan 2008. (PDF) (BibTeX)



Structured Sequence Motifs: Search and Extraction (sMotif and exMotif)

sMotif and exMotif are two complementary for searching and extracting/mining structured sequence motifs DNA sequences. A structured motif consists of simple motifs separated by different gap lengths. The simple motif may be a simple pattern or a position weighted matrix or profile. Given a template structured motif (pattern or profile), sMotif finds all matches in a given set of sequences. On the other hand, exMotif mines novel motifs matching some minimal conditions on the gaps and frequency.

  • Download
  • Relevant Publications
    • Yongqiang Zhang and Mohammed J. Zaki, EXMOTIF: efficient structured motif extraction. Algorithms for molecular biology, 1(21). Nov 2006. ((URL)) (PDF) (BibTeX)
    • Yongqiang Zhang and Mohammed J. Zaki, SMOTIF: efficient structured pattern and profile motif search. Algorithms for molecular biology, 1(22). Nov 2006. ((URL)) (PDF) (BibTeX)

↑ Back to Table of Contents



Protein Structure


Flexible and Non-sequential Protein Structure Alignment (SNAP/STSA and FlexSnap)

SNAP finds non-sequential 3D protein structure alignments. It was initially called STSA (2008-snap). FlexSnap allows the ability to find both flexible and non-sequential allignments.

  • Download
  • Relevant Publications
    • Saeed Salem, Mohammed J. Zaki and Chris Bystroff, Iterative Non-Sequential Protein Structural Alignment. Journal of Bioinformatics and Computational Biology, 7(3):571-596. Jun 2009. (PDF) (BibTeX)
    • Saeed Salem, Mohammed J. Zaki and Chris Bystroff, FlexSnap: Flexible Non-Sequential Protein Structure Alignment. In 9th Workshop on Algorithms in Bioinformatics. Sep 2009. (PDF) (BibTeX)



Protein Docking and Partial Shape Matching (ContextShapes)

ContextShapes does rigid-body protein docking. It uses a novel contextshapes data structure to represent local surface regions/shapes on the protein. All critical points on both the receptor and ligand are represented via context shapes, and the best docking is found via pair-wise matching.

  • Download
  • Relevant Publications
    • Zujun Shentu, Mohammad Al Hasan, Chris Bystroff and Mohammad J. Zaki, Context Shapes: Efficient Complementary Shape Matching for Protein-Protein Docking. Proteins: Structure, Function and Bioinformatics, 70(3):1056-1073. Feb 2008. (PDF) (BibTeX)



Protein Indexing (PSIST)

PSIST uses suffix trees to index protein 3D structure. It first converts the 3D structure into a structure-feature sequence over a new structural alphabet, which is then used to index protein structures. The PSIST index makes it very fast to query for a matching structural fragment.

  • Download
  • Relevant Publications
    • Feng Gao and Mohammed J. Zaki, PSIST: A Scalable Approach to Indexing Protein Structures using Suffix Trees. Journal of Parallel and Distributed Computing, 68(1):55-63. Jan 2008. (PDF) (BibTeX)



Protein Folding Pathways (UNFOLD)

UNFOLD uses a recursive min-cut on a weighted secondary structure element graph to predict the sequence of protein (un)folding events.

  • Download
  • Relevant Publications
    • Mohammed J. Zaki, Vinay Nadimpally, Deb Bardhan and Chris Bystroff, Predicting protein folding pathways. Bioinformatics, 20(1):i386-i393. Aug 2004. (PDF) (BibTeX)

↑ Back to Table of Contents



Real and Synthetic Datasets

The section contains various synthetic and real datasets used in some of the papers related to itemset, sequence, tree and XML mining.

↑ Back to Table of Contents