"Mining petabytes of data using cloud computing and a massively parallel cyberinstrument” P. Drineas (PI), C. D. Carothers (Co-PI), A. Garcia (Co-PI), M. J. Zaki (Co-PI)

The amount of data in our world has exploded, with data being at the heart of modern economic activity, innovation, and growth. There is little doubt that big data play an increasingly useful economic role and that their business and economic implications are critical issues that policy makers and world leaders must take into account. The importance of understanding and making effective use of truly massive, e.g., petabyte(PB)-scale data, is becoming essential in science, engineering, and business, and is a national priority for the United States research agenda. The main objective of this proposal is the design, analysis, and implementation of a number of fundamental matrix-mining and graph-mining operations that are scalable to petabyte-sized inputs. Such efforts guarantee the continuation of the phenomenal growth in analyzing, visualizing, and extracting information from massive matrices and graphs.

In order to achieve the proposed goals, the PIs will implement algorithms on a massively parallel machine with access to approximately 1.2PBs of storage. The project's objective is the design and analysis of approximation algorithms for matrix and graph mining tasks that follow an iterative, two-step approach: given PB-scale data, first, using computationally cheap approaches, sketch the data in order to reduce their size from the petabyte-scale to the terabyte-scale; then, process the sketch using computationally expensive approaches on terabyte-scale data. This process will be iterated using the approximate solutions in order to improve the quality of the sketches and the approximation guarantees. The proposed project will release software and libraries for matrix and graph mining algorithms that implement such two-phase approaches for petabyte-scale matrices and graphs. Additionally, the developed tools will be applied on the analysis of petabyte-scale data emerging from computer simulations of the dynamics of biomolecular systems. As such, it is expected that the developed algorithms will impact the areas of linear algebra, randomized algorithms, information retrieval, and data mining, as well as bioinformatics. In order to disseminate the proposed research, the PIs intend to organize workshops and working group meetings, and will disseminate their research via articles intended for broader audiences.