The Space Complexity of Pass-Efficient Algorithms for Clustering

Full textClick to download.
CitationIn Proc. of the 17th Symposium on Discrete Algorithms
AuthorsKevin Chang
Ravi Kannan


We present multipass streaming algorithms for a basic clustering problem for massive data sets. If our algorithm is allotted $2\ell$ passes, it will produce an approximation with error at most $\epsilon$ using $O^*(k^3/\epsilon^{2/\ell})$ bits of memory, the most critical resource for streaming computation We demonstrate that this tradeoff between passes and memory allotted is intrinsic to the problem and model of computation by proving lower bounds on the memory requirements of any $\ell$ pass {\em randomized} algorithm that are nearly matched by our upper bounds. To the best of our knowledge, this is the first time matching bounds have been proved for such an exponential tradeoff for randomized computation.

In this problem, we are given a set of $n$ points drawn randomly according to a mixture of $k$ uniform distributions (or more generally linear distributions), and wish to approximate the density function of the mixture. The points are placed in a datastream (possibly in adversarial order), which may only be read sequentially by the algorithm. We argue that this realistically models, among others, the datastream produced by a national census of the incomes of all citizens.

Back to publications
Back to previous page