Full text  Click to download. 
Citation  in Proceedings of the 2007 ACM/SIAM Symposium on
Discrete Algorithms

Authors  Anirban Dasgupta
John Hopcroft Ravi Kannan Pradipta Mitra 
This paper considers the wellstudied problem of clustering a set of objects (where each object is an $n$vector) under a probabilistic model of data in which each object is one of $k$ types. In general, the earlier papers (many dealing with the socalled ``planted'' problems on graphs) assume that all coordinates of objects are independent random variables. They then appeal to the Theory of Random Matrices which bounds the largest eigenvalue of such matrices. However, in most practical applications, this full independence is not realistic. Instead, here, we only assume here that the objects are independent, but the coordinates of the each object may not be. We first generalize the required results from Random Matrices to this case of limited independence using some new techniques developed in Functional Analysis. Surprisingly, we are able to prove results quite similar to the full independent case with only extra logarithmic factors. Using these bounds, we develop clustering algorithms. Our clustering algorithms have a substantially simpler ``cleanup'' phase than known algorithms. They also differ in that, now, objects and their coordinates (features) are not symmetric. Like the earlier papers, we make certain assumptions on the probabilistic model. We show that our models subsume not only the planted models, but also another set of models under which there is a body of clustering algorithms, namely the Gaussian mixture models.