From Mohammed J. Zaki

Software: Software

Disclaimer: The software on this page is provided on an as is basis for research purposes. There is no additional support offered, nor are the author(s) or their institutions liable under any circumstances.

Acknowledgments

We gratefully acknowledge the funding from the following agencies/programs that made the research possible:



Table of Contents (hide)



Arabesque: Distributed Graph Mining Framework


Distributed data processing platforms such as MapReduce and Pregel (Giraph, GraphLab, Graph-X) have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms such as computing Pagerank, SSP, CC etc. However, these platforms do not represent a good match for graph mining problems, as for example finding frequent subgraphs (FSM) in a labeled graph or finding Cliques. Given an input graph, these problems require exploring a very large number of subgraphs(embeddings) and finding patterns that match some “interestingness” criteria desired by the user. The number of the embeddings that the system needs to explore grows exponential to the size of the embeddings.

Arabesque: Think like an Embedding paradigm Our system Arabesque is the first distributed data processing platform focused on solving these graph mining problems. Arabesque automates the process of exploring a very large number of embeddings, and defines a high-level computational model (filter – process) that simplifies the development of scalable graph mining algorithms. In a nutshell, Arabesque explores embeddings and passes them to the user defined code, which must simply decide whether an embedding is of interest and should be further extended (filter function) and whether we should compute outputs for the embeddings that passed the filter function (process function). This allows the end-user to think of the problem in a natural way, Think like an Embedding (TLE), where the embedding is a first class citizen, and not to try to force the problem into other paradigms. Even so, Arabesque runs under the familiar and well supported Apache Giraph system in a Hadoop cluster. Arabesque simply uses Giraph as a generic data processing platform that provides a BSP execution flow.

Elegant but above all Efficient Arabesque is quite unique in that despite the simple and elegant API and the distributed implementation it is extremely efficient. Indeed, in our SOSP 2015 paper, we show that a single threaded Arabesque provides a comparable performance to specialized centralized implementations. This is possible because the user-defined functions are not the computational expensive parts of the Graph Mining Algorithms. The computational expensive functions are handled by Arabesque and deal mostly with the graph exploration, the isomorphism computations and the canonicality checks all inherent to Graph mining algorithms, and thus can be handled as efficient as a centralized implementation.




Data Mining Template Library (DMTL)


DMTL is an open-source, high-performance, generic data mining toolkit, written in C++. It provides a collection of generic algorithms and data structures for mining increasingly complex and informative patterns types such as: Itemsets, Sequences, Trees and Graphs.

DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. The kind of pattern to be mined, the kind of mining approach to use, and the kind of data types and formats to mine over are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications.

DMTL implements all algorithms following the vertical data mining approach. For itemsets, the implementation follows the Eclat approach (without diffsets). For sequences it follows the Spade approach. For trees it implements the SLEUTH framework, which allows one to mine embedded/induced and ordered/unordered trees. Finally, the graph mining framework uses a novel vertical approach.




Itemset Mining and Association Rules

The section contains the code for mining all frequent itemsets, all closed itemsets, and all maximal frequent itemsets. It also includes the code for constructing the concept lattice (or iceberg lattice) and generating nonredundant association rules.

Frequent Itemsets (Eclat)

Eclat uses the original vertical tidset approach for mining all frequent itemsets (1997-eclat), combined with the diffsets improvement (2003-diffsets). The code also contains the maxeclat, clique, and maxclique approaches mentioned in (2000-eclat:tkde).



Closed Itemsets (Charm, Charm-L) and Non-redundant Rule Generation

Charm mines all the frequent closed itemsets as described in (2002-charm). Charm-L adds the ability to construct the entire frequent concept lattice (also called as the iceberg lattice), that is, it adds the links between all sub/super-concepts (or closed itemsets) (2005-charm:tkde). This ability is used to mine the non-redundant association rules (2004-nonredundant:dmkd).



Maximal Itemsets (GenMax)

Genmax mines all maximal frequent itemsets via a backtracking approach with progressive focusing.



Frequent Boolean Expressions (BLOSOM)

The BLOSOM framework allows one to mine arbitrary frequent boolean expressions include AND clauses (itemsets), OR clauses, and CNF/DNF expressions. It focuses on mining the minimal boolean expressions.

Sampling Minimal Boolean Expressions (minDNF)

The minDNF method samples minimal boolean expressions in DNF.




Sequence Mining


Frequent Sequences (Spade)

Spade uses the vertical format for mining the set of all frequent sequences from a dataset of many sequences. It also mines the frequent sequences of itemsets.



Sequence Constraints (cSpade)

cSpade mines constrained frequent sequences. The constraints can take the form of length or width limitations on the sequences, minimum or maximum gap constraints on consecutive sequence elements, applying a time window on allowable sequences, incorporating item constraints, and finding sequences predictive of one or more classes. The class specific sequences can be used for sequence classification as described in (2000-featuremine:is).



PRISM Algorithm: Sequence Mining

PRISM mines frequent sequences using the block-encoding approach as described in (2010-prism:jcss).




Itemset and Sequence Utilities

Provides the utilities needed for Eclat, Charm/Charm-L, GenMax, Spade and cSpade.




Tree Mining


TreeMiner and TreeMiner-D

Treeminer uses the vertical scope-list representation to mine frequent sequences (2005-treeminer:tkde). Treeminer counts all embeddings, whereas Treeminer-D counts only distinct occurrences, which can be more appropriate for some datasets.



SLEUTH

SLEUTH extends the TreeMiner methodology to mine all frequent embedded or induced as well as ordered or unordered tree patterns.



XML Classification (XRules)

XRules uses frequent tree mining to mine discriminative patterns from tree-structured XML documents.




Graph Mining & Indexing


Representative Orthogonal Graph Mining (Origami)

Origami uses random walks over the graph partial order to mine a representative sample of the maximal frequent subgraph patterns. It next selects only the orthogonal representative patterns via a local clique-finding method.



Graph Pattern Sampling (Output Space Sampling)

Whereas Origami can mine a sample of maximal graph patterns it does not provide any uniformity guarantee. MUSK (2009-musk) proposes a Markov Chain Monte Carlo based approach to guarantee a uniform sample of all maximal patterns. The MCMC approach was further extended in (2009-graphsampling) to mine a sample of all frequent patterns, to mine support-biased patterns and also to mine a sample of discriminative patterns.



Scalable Graph Reachability Indexing (GRAIL)

GRAIL uses random multiple interval labelings, and a variety of optimizations to perform rapid reachability testing in very large graphs (with millions of nodes and edges).

Dynamic Graph Reachability Indexing (Dagger)

Dagger builds on top of GRAIL to handle dynamic graphs, i.e., it supports reachability in graphs with edge and node insertions and deletions.






Clustering

This section contains code for mining categorical subspace clusters, shape based clusters, for clustering based on a lower bound on similarity, and a new outlier-based approach for initial cluster seed selection.

Interesting Subspace Mining (SCHISM)

SCHISM finds interesting subspace clusters.



Categorical Subspace Clustering (CLICKS)

CLICKS finds subspace clusters in categorical data using a k-partite clique mining approach.



Arbitrary Shape Clustering (Sparcl)

Sparcl finds shape-based clusters. It uses a two step approach: in the first step we select a relatively large number of candidate centroids (via ROBIN) to find seed clusters via the K-means algorithm and in the second step we use a novel similarity kernel to merge the initial seed clusters to yield the final arbitrary shaped clusters.



K-means Initialization (Robin)

ROBIN uses a new local outllier factor based initial seed selection to improve k-means style clustering.



Microarray Gene Expression Clustering (TriCluster and MicroCluster)

Tricluster is the first tri-clustering algorithm for microarray expression clustering. It builds upon the new microCluster bi-clustering approach. Tricluster first mines all the bi-clusters across the gene-sample slices, and then it extends these into tri-clusters across time or space (depending on the third dimension). It can find both scaling and shifting patterns.




Biological Sequence Analysis

This section contains code for sequence modeling via Hidden Markov Models, code for structured motif extraction and search, and genome-scale disk-based suffix tree indexing.

Hidden Markov Models: Variable Order HMM with Duration (VOGUE)

VOGUE is a variable order and gapped HMM with with duration. It uses sequence mining to extract frequent patterns in the data. It then uses these patterns to build a variable order HMM with explicit duration on the gap states, for sequence modeling and classification. VOGUE was applied to model protein sequences, as well as a number of other sequence datasets including weblogs.



Genome Scale Indexing: Disk-based Suffix Trees (Trellis and Trellis+)

Trellis is a disk-based suffix tree indexing methods (with suffix links) that is capable of indexing the entire human genome on a commodity PC with limited memory. Trellis+ extends Trellis by further removing some memory limitations by using a novel guide suffix tree in memory.



Structured Sequence Motifs: Search and Extraction (sMotif and exMotif)

sMotif and exMotif are two complementary for searching and extracting/mining structured sequence motifs DNA sequences. A structured motif consists of simple motifs separated by different gap lengths. The simple motif may be a simple pattern or a position weighted matrix or profile. Given a template structured motif (pattern or profile), sMotif finds all matches in a given set of sequences. On the other hand, exMotif mines novel motifs matching some minimal conditions on the gaps and frequency.




Protein Structure


Flexible and Non-sequential Protein Structure Alignment (SNAP/STSA and FlexSnap)

SNAP finds non-sequential 3D protein structure alignments. It was initially called STSA (2008-snap). FlexSnap allows the ability to find both flexible and non-sequential allignments.



Protein Docking and Partial Shape Matching (ContextShapes)

ContextShapes does rigid-body protein docking. It uses a novel contextshapes data structure to represent local surface regions/shapes on the protein. All critical points on both the receptor and ligand are represented via context shapes, and the best docking is found via pair-wise matching.



Protein Indexing (PSIST)

PSIST uses suffix trees to index protein 3D structure. It first converts the 3D structure into a structure-feature sequence over a new structural alphabet, which is then used to index protein structures. The PSIST index makes it very fast to query for a matching structural fragment.



Protein Folding Pathways (UNFOLD)

UNFOLD uses a recursive min-cut on a weighted secondary structure element graph to predict the sequence of protein (un)folding events.




Real and Synthetic Datasets

The section contains various synthetic and real datasets used in some of the papers related to itemset, sequence, tree and XML mining.

Retrieved from http://www.cs.rpi.edu/~zaki/www-new/pmwiki.php/Software/Software
Page last modified on October 05, 2016, at 10:51 PM