My work is centered around data
mining, the science of discovering latent and useful knowledge in large
databases (also known as KDD - Knowledge Discovery and Data mining). As part of
the group (Data Mining Template Library - DMTL, see Projects below) under Prof.
Zaki, I am currently working on frequent pattern mining (FPM). FPM is a category
of problems in data mining that deals with finding frequent patterns (viz.
itemsets, sequences, trees or graphs) in massive relational databases.
Unfortunately FPM (even data mining for that matter) is not well documented -
largely because data mining is a relatively new field which is witnessing rapid
new developments. Some useful pointers on data mining:
Frequent patterns aren't always
"interesting", but that would be the focus of another problem. Our
group is working towards developing a unified and generic
framework for FPM. We currently have efficient and functional modules for
itemsets, sequences, trees and graphs as part of our library. My work in the
group has largely revolved around tree and graph mining (see references below
for relevant papers). Graph mining is particularly challenging because of the
inherent complexity of graphs, raising issues like graph isomorphism (a problem
known to be NP, but not proved to be NP-complete). We aim to have an open source
release of our library soon - watch this space!
References - (please also check the
publications section)
Projects
- Data Mining Template Library: Under
progress
Developing a generic and unified framework
for efficient frequent pattern mining; comprising itemset, sequence, tree and
graph mining components.
Resources used: C++ (using STL)
DMTL's public release
- DNA Sequence Inhomogeneity using log-odds
(course project in Algorithms in Computational Molecular Biology):
April 2004
Design and implementation of a scheme to detect inhomogeneity in long DNA
sequences using log-odds analysis.
Resources used: C++
- Network Service Design (course project in
Network Programming): April 2004
Conceptual design of a scalable, reliable and secure Instant Messaging
System (ala MSN). Read my design document here.
- Simulator of Memory Management Unit
(course project in Operating Systems):
October 2003
Development of a complete emulator of the Memory Management Unit (MMU) of an
operating system.
Resources used: C
- Characterization of Video Sequences
(Major BS Thesis): December 2002 - May 2003
The project applies artificial intelligence
concepts for information retrieval from video sequences. The video
clips are first segmented, following which information pertaining to the clips
is extracted. This information is analyzed using an intelligent system to
categorize the sequences into distinct classes.
Resources used: Matlab
- Study of Classification using Neural
Networks and Data Mining Concepts: June - July 2002
Study of classification concepts, among them
prominent classification tools like neural networks and decision trees. A
comparative analysis was made on the performance of these two techniques. A
popular data mining algorithm, SLIQ was used for classification by decision
trees, along with back propagation algorithm for neural networks.
Resources used: Matlab, C++
- Association Rule Mining: Extension of
Direct Hashing & Pruning Concepts to Quantitative Databases: March
2002
The project involved application of the Apriori algorithm for association
rule mining, with extension using Direct Hashing and Pruning (DHP) concepts
to quantitative databases.
Resources used: Java, Oracle
- Handwritten Character Recognition using
Neuro-Fuzzy Techniques: July - September 2001
Devising a Neuro – Fuzzy technique for
character recognition, combining the strengths of Fuzzy Logic to overcome the
drawbacks in traditional character recognition systems that employ Artificial
Neural Networks alone.
Resources used: Matlab