The Space Complexity of Pass-Efficient Algorithms for Clustering

Chang, Kevin L.; Kannan, Ravi

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

The Space Complexity of Pass-Efficient Algorithms for Clustering

MPS-Authors

/persons/resource/persons44227

Chang, Kevin L.
Algorithms and Complexity, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Chang, K. L., & Kannan, R. (2006). The Space Complexity of Pass-Efficient Algorithms for Clustering. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA'06 (pp. 1157-1166). New York, USA: ACM / Siam.

Cite as: https://hdl.handle.net/11858/00-001M-0000-000F-242C-A

Abstract

We present multiple pass streaming algorithms for a basic clustering problem for massive data sets. If our algorithm is allotted 2l passes, it will produce an approximation with error at most ε using Õ(k3/ε2/l) bits of memory, the most critical resource for streaming computation. We demonstrate that this tradeoff between passes and memory allotted is intrinsic to the problem and model of computation by proving lower bounds on the memory requirements of any l pass randomized algorithm that are nearly matched by our upper bounds. To the best of our knowledge, this is the first time nearly matching bounds have been proved for such an exponential tradeoff for randomized computation.In this problem, we are given a set of n points drawn randomly according to a mixture of k uniform distributions and wish to approximate the density function of the mixture. The points are placed in a datastream (possibly in adversarial order), which may only be read sequentially by the algorithm. We argue that this models, among others, the datastream produced by a national census of the incomes of all citizens.