Full text  Click to download. 
Citation 
PhD Thesis, Yale University, 2006.

Author  Kevin Chang

The proliferation of computational problems involving massive data sets has necessitated the design of computational paradigms that model the extra constraints placed on systems processing very large inputs. Among these algorithmic paradigms, the passefficient model captures the constraints that the input may be much too large to fit in main memory for processing, and that system performance is optimized by sequential access to the data in storage. Thus, in the passefficient model of computation, an algorithm may make a constant number of sequential passes over readonly input while using a small amount of random access memory. The resources to be optimized are memory, number of passes, and per element processing time. We give passefficient algorithms for clustering and finding structure in large amounts of data. Our algorithms have the property that the number of passes allotted is an input parameter to the algorithm. We answer questions regarding the intrinsic tradeoffs between the number of passes used by a passefficient algorithm and the amount of random access memory required. Our algorithms use adaptive sampling techniques that are quite general and can be used to solve many massive data set problems. The first family of clustering problems that we consider is learning mixtures of distributions. In these problems, we are given samples drawn according to a probability distribution known as a mixture of distributions, and must reconstruct the density function of the original mixture. Our algorithms show tradeoffs between the number of passes and the amount of memory required: if the algorithm makes a few extra passes, the amount of memory required drops off sharply. We also prove lower bounds on the amount of memory needed by any Lpass randomized algorithm, thus showing that our tradeoff is nearly tight. The second family of clustering problems that we consider is the combinatorial optimization problem of facility location and related problems. Our passefficient algorithms for this problem exhibit the same sharp tradeoffs as our algorithm for learning mixtures of distributions. We also give clustering algorithms that are not in the streaming model for partitioning a graph to approximately minimize certain natural objective functions.