|Full text||Click to download.|
|Citation||in Proceedings of the 2007 ACM/SIAM Symposium on
This paper considers the well-studied problem of clustering a set of objects (where each object is an $n-$vector) under a probabilistic model of data in which each object is one of $k$ types. In general, the earlier papers (many dealing with the so-called ``planted'' problems on graphs) assume that all coordinates of objects are independent random variables. They then appeal to the Theory of Random Matrices which bounds the largest eigenvalue of such matrices. However, in most practical applications, this full independence is not realistic. Instead, here, we only assume here that the objects are independent, but the coordinates of the each object may not be. We first generalize the required results from Random Matrices to this case of limited independence using some new techniques developed in Functional Analysis. Surprisingly, we are able to prove results quite similar to the full independent case with only extra logarithmic factors. Using these bounds, we develop clustering algorithms. Our clustering algorithms have a substantially simpler ``clean-up'' phase than known algorithms. They also differ in that, now, objects and their coordinates (features) are not symmetric. Like the earlier papers, we make certain assumptions on the probabilistic model. We show that our models subsume not only the planted models, but also another set of models under which there is a body of clustering algorithms, namely the Gaussian mixture models.
Back to publications
Back to previous page