3rd Workshop on High Performance Data Mining
Friday, May 5th, 2000, Cancun, Mexico
in conjunction with
International Parallel and Distributed Processing Symposium 2000 ( IPDPS'00 )

The explosive growth in data collection in business and scientific fields has literally forced upon us the need to analyze and mine useful knowledge from it. Data mining refers to the entire process of extracting useful and novel patterns/models from large datasets. Due to the huge size of data and amount of computation involved in data mining, high-performance computing is an essential component for any successful large-scale data mining application. This workshop will provide a forum for presenting recent results in high-performance computing for data mining including applications, algorithms, software, and systems. High-performance should be broadly interpreted as including scalable sequential as well as parallel and distributed algorithms and systems.
Relevant topics include (but are not limited to):
  1. scalable and/or parallel/distributed algorithms for various mining tasks like classification, clustering, sequences, associations, trend and deviation detection, etc.
  2. methods for pre/post-processing like feature extraction and selection, discretization, rule pruning, model scoring, etc.
  3. frameworks for KDD systems, and parallel or distributed mining.
  4. integration issues with databases and datawarehouses.
Workshop proceedings have been published in LNCS Vol. 1800, Springer-Verlag, 2000 .

Online papers from this workshop are also available at http://ipdps.cc.gatech.edu/2000/datamine/index.html.


Attendees are required to register for the main symposium; no separate registration is needed for the workshop. Please see http://www.ippsxx.org for IPDPS'00 registration information.

Preliminary Program:

8:15-8:30am, Opening Remarks

8:30-9:15am, Invited Talk, Robert Hollebeek, University of Pennsylvania
Professor Robert Hollebeek will present three examples of large-scale data intensive computing applications that combine large-scale storage, parallel mining and distributed networking. These include a radiology storage infrastructure for the National Library of Medicine Next Generation Internet program , a digital government database for Census and demographic data in the City of Philadelphia, and a parallel networking project using high speed optical networks to enable distributed parallel data computing. Professor Hollebeek is a Professor of Physics at the University of Pennsylvania and co-founder of the National Scalable Cluster Project (NSCP). The talk will end with lessons on large scale data mining that have been learned from the experience of NSCP.

9:15-9:40am, Implementation issues in the design of I/O intensive data mining applications on clusters of workstations,
R. Baraglia, D. Laforenza, S. Orlando, O. Palmerini, R. Perego, CNR and Universita Ca' Foscari di Venezia, Italy

9:40-10:05am, The parallelization of a knowledge discovery system with hypergraph representation,
J. Seitzer, J. P. Buckley, Y. Pan, L. A. Adams, U. of Dayton, USA

10:05-10:30am, Coffee Break

10:30-10:55am, Scalable parallel clustering for data mining on multicomputers,
D. Foti, D. Lipari, C. Pizzuti, D. Talia, ISI-CNR, UNICAL, Italy

10:55-11:20am, A requirement analysis for parallel KDD systems,
W. A. Maniatty, M. J. Zaki, SUNY Albany and RPI, USA

11:20-11:50pm, Parallel data mining on ATM-connected PC cluster and optimization of its execution environment,
M. Oguchi and M. Kitsuregawa, U. Tokyo, Japan

11:50-1:00pm, Lunch Break

1:00-1:45pm, Invited Talk: Seeking Parallelism in Data Mining Techniques, Domenico Talia, ISI-CNR, Italy
Abstract Data mining is the automated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are 'interesting' in the sense of impacting an organization's practice. Data mining and knowledge discovery on large amounts of data can benefit of the use of parallel computers both to improve performance and quality of data selection. When data mining tools are implemented on high-performance parallel computers, they can analyze massive databases in a reasonable time. Faster processing also means that users can experiment with more models to understand complex data. High performance makes it practical for users to analyze greater quantities of data.

This talk analyzes and presents different forms of parallelism that can be exploited in data mining techniques and algorithms. The main goal of the talk is to discuss data mining techniques on parallel architectures and show how large scale data mining and knowledge discovery applications can be scalable by using systems, tools and performance offered by parallel processing systems. For each data mining technique (such as rule induction, clustering algorithms, decision trees, genetic algorithms, neural networks, etc.) the possible ways to exploit parallelism are presented and discussed in detail. Finally, the talk outlines current research issues in high-performance data mining and discusses perspectives in this area.

1:45-2:10pm, Parallelisation of C4.5 as a particular divide and conquer computation,
P. Becuzzi, M. Coppola, S. Ruggieri and M. Vanneschi, Italy

2:10-2:35pm, Exploiting dataset similarity for distributed mining,
S. Parthasarathy and M. Ogihara, U. of Rochester, USA

2:35-3:00pm, Parallel data mining of bayesian networks from telecommunications network data,
R. Sterritt, K. Adamson, C. M. Shapcott, E. P. Curran, University of Ulster, UK

3:00-3:10pm, Closing Remarks

Cancelled Talk, Scalable model for extensional and intensional descriptions of unclassified data,
H. A. Prado, S. C. Hirtle, P. M. Engel, Catholic U. of Brazil, U. Pittsburgh and Federal U. Rio Grande, Brazil

Paper Submission:

The workshop will feature contributed papers and invited papers in an informal setting. To submit a paper for consideration, send 4 copies of the manuscript to Mohammed Zaki (zaki.AT.cs.rpi.edu). Electronic submissions (postscript versions printable on 8.5 x 11 paper only) are strongly encouraged. To guarantee consideration, manuscripts must be received by Dec. 1, 1999, and must be no more than 5 pages (single spaced, at least 10pt font) excluding figures, tables, and references. In the spirit of the workshop, submission of works in progress are encouraged as well.

Important Dates:

Papers Due: December 1st, 1999
Acceptance Notification: January 17th, 2000
Camera Ready Papers Due: January 28th, 2000

Workshop Chairs:

Mohammed J. Zaki
Rensselaer Polytechnic Institute

Vipin Kumar
University of Minnesota

David Skillicorn
Queens University, Canada

Program Committee:

Philip K. Chan, Florida Institute of Technology
Alok Choudhary, Northwestern University
Umeshwar Dayal, Hewlett-Packard Labs.
Alex A. Freitas, PUC-PR (Pontifical Catholic University of Parana), Brazil
Ananth Grama, Purdue University
Robert Grossman, University of Illinois-Chicago
Yike Guo, Imperial College, UK
Jiawei Han, Simon Fraser University, Canada
Howard Ho, IBM Almaden
Chandrika Kamath, Lawrence Livermore National Lab
Masaru Kitsuregawa, University of Tokyo, Japan
Sanjay Ranka, University of Florida
Vineet Singh, Hewlett-Packard Labs.
Domenico Talia, ISI-CNR, Rende, Italy
Kathryn Thornton, University of Plymouth, UK

 Number of Visitors