Full text | Click to download. |
Citation | In Proc. of the 17th Symposium on Discrete Algorithms
|
Authors | Kevin Chang
Ravi Kannan |
We present multipass streaming algorithms for a basic clustering problem
for massive data sets. If our algorithm is allotted $2\ell$ passes, it
will produce an approximation with error at most $\epsilon$ using
$O^*(k^3/\epsilon^{2/\ell})$ bits of memory, the most critical resource
for streaming computation We demonstrate that this tradeoff between passes
and memory allotted is intrinsic to the problem and model of computation
by proving lower bounds on the memory requirements of any $\ell$ pass {\em
randomized} algorithm that are nearly matched by our upper bounds. To the
best of our knowledge, this is the first time matching bounds have been
proved for such an exponential tradeoff for randomized computation.
In this problem, we are given a set of $n$ points drawn randomly according
to a mixture of $k$ uniform distributions (or more generally linear
distributions), and wish to approximate the density function of the
mixture. The points are placed in a datastream (possibly in
adversarial order), which may only be read sequentially by the algorithm.
We argue that this realistically models, among others, the datastream
produced by a national census of the incomes of all citizens.