* Faculty       * Staff       * Students & Alumni       * Committees       * Contact       * Institute Directory
* Undergraduate Program       * Graduate Program       * Courses       * Institute Catalog      
* Undergraduate       * Graduate       * Institute Admissions: Undergraduate | Graduate      
* Colloquia       * Seminars       * News       * Events       * Institute Events      
* Overview       * Lab Manual       * Institute Computing      
No Menu Selected

* News


Sublinear Algorithms on Massive Data Sets

Tugkan Batu
Simon Fraser University

Thursday, April 7, 2005
JEC 3117 - 4:00 p.m. to 5:00 p.m.
Refreshments at 3:30 p.m.

Technological advances of the last decades such as cheaply available computing power, high-volume data storage, and the omnipresence of networking made it possible to collect and store enormous amounts of data. Many problems that arise in data mining, machine learning applications, and analysis of experimental data in scientific fields require that we understand global statistical properties of data. For this class of problems, one critical parameter that affects the efficiency of the algorithm is the size of the domain over which the data is distributed. Standard statistical techniques perform poorly when the data is over a large domain. We study the sample complexity of various fundamental statistical inference tasks as a function of the domain size of the underlying discrete probability distributions. All these problems take samples from the distributions as input. The tasks are (1) to distinguish identical pairs of distributions from pairs of distributions that have large statistical distance, (2) to test whether the input distribution is statistically close to an explicitly specified distribution, (3) to test whether the marginal distributions induced by a joint distribution are independent, (4) to estimate the entropy of a distribution. We give algorithms with sublinear sample complexity for each of these problems. We also show that that our results for the first three problems are tight up to polylogarithmic factors. We also investigate special classes of distributions, in which we can solve these problems much more efficiently.

In addition to statistical problems mentioned above, many problems from various domains are given sublinear-time approximation algorithms in recent years. To illustrate the surprising power of sublinear-time computation, we show how to determine whether the edit distance between two given strings is small in sublinear time. Specifically, we present a test which, given two n-character strings, runs in sublinear time and, with high probability, returns ``CLOSE'' if their edit distance is $O(n^c)$, where $c<1$, and ``FAR'' if their edit distance is linear. Our algorithm thus provides a trade-off between accuracy and efficiency that is particularly useful when the input data is very large.

Last updated: March 28, 2005