
News
Colloquia
Sublinear Algorithms on Massive Data Sets
Tugkan Batu Simon Fraser University
Thursday, April 7, 2005
JEC 3117  4:00 p.m. to 5:00 p.m.
Refreshments at 3:30 p.m.
Technological advances of the last decades such as cheaply available
computing power, highvolume data storage, and the omnipresence of
networking made it possible to collect and store enormous amounts of
data. Many problems that arise in data mining, machine learning
applications, and analysis of experimental data in scientific fields
require that we understand global statistical properties of data. For
this class of problems, one critical parameter that affects the
efficiency of the algorithm is the size of the domain over which the
data is distributed. Standard statistical techniques perform poorly
when the data is over a large domain. We study the sample complexity
of various fundamental statistical inference tasks as a function of the
domain size of the underlying discrete probability distributions. All
these problems take samples from the distributions as input. The tasks
are (1) to distinguish identical pairs of distributions from pairs of
distributions that have large statistical distance, (2) to test
whether the input distribution is statistically close to an explicitly
specified distribution, (3) to test whether the marginal distributions
induced by a joint distribution are independent, (4) to estimate the
entropy of a distribution. We give algorithms with sublinear sample
complexity for each of these problems. We also show that that our
results for the first three problems are tight up to polylogarithmic
factors. We also investigate special classes of distributions, in
which we can solve these problems much more efficiently.
In addition to statistical problems mentioned above, many problems
from various domains are given sublineartime approximation algorithms
in recent years. To illustrate the surprising power of sublineartime
computation, we show how to determine whether the edit distance
between two given strings is small in sublinear time. Specifically, we
present a test which, given two ncharacter strings, runs in sublinear
time and, with high probability, returns ``CLOSE'' if their edit
distance is $O(n^c)$, where $c<1$, and ``FAR'' if their edit distance is
linear. Our algorithm thus provides a tradeoff between accuracy and
efficiency that is particularly useful when the input data is very large.
Last updated: March 28, 2005

