Randomized Maximal Graph Mining

Coming soon!

Data Mining Template Library (DMTL)

DMTL is an open-source toolkit of frequent pattern mining algorithms. It is developed in C++ with extensive use of generic programming concepts.
It provides a collection of generic algorithms and data structures for mining increasingly complex and informative patterns types, such as, itemsets, sequences, trees and graphs. DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. The type of pattern to be mined, the mining strategy to use, and the kind of data types and formats to mine over are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications.

Version 1.0 of DMTL can be downloaded from SourceForge. DMTL has been downloaded 1650 times (as of 11/01/2006).

DMTL has exposed me to some tricky issues with template based programming. Along the way, I have also designed and developed a custom allocator (something thats not done very often in my opinion). Thanks to lack of good support for debugging templated code, I have come to decipher verbose g++ messages fairly well :-)

Mohammad Hasan, Vineet Chaoji, Saeed Salem, Nagender Parimi, and Mohammed Zaki, DMTL: A Generic Data Mining Template Library, in Workshop on Library-Centric Software Design (LCSD'05), with Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'05) conference, San Diego, California, October 2005

Disk Based Suffix-tree Construction

This is a fairly simple C++ based implementation of a file-based suffix tree. The algorithm is based on the paper Sandeep Tata, Richard A. Hankins, Jignesh M. Patel: Practical Suffix Tree Construction. VLDB 2004.

Even though the performance does not compete with that mentioned in the paper, the implementation is quite simple and modular. Persistence is achieved using memory-mapped IO. The code compiles on Linux with g++ 3.3.

And finally, download from here.

Text Classification Framework

As part of my masters thesis, I developed a text classification framework. The framework provides the ability to tokenize and stem the word stream. Keywords can be eliminated using stop-word list or a TF-IDF based threshold. Beyond that feature selection can be performed. The code has a Naive-Bayes classifier in a co-training setting.

The code is written in Java, but I have not looked at it for quite some time now. It can be downloaded from here.

The co-training implementation is based on the paper Combining Labeled and Unlabeled Data with Co-Training. Avrim Blum, Tom Mitchell, COLT 1998

Other things that I have developed (but either I have lost the code or its in a shabby state) include:
  • Modified SVMlight to be trained incrementally. The implementation was based on the paper "G. Cauwenberghs and T. Poggio. Incremental and Decremental Support Vector Machine Learning. Advances in Neural Information Processing Systems, 2000".
  • Matlab implementation of spectral clustering based on the paper by Shi and Malik.