Full text  Click to download. 
Citation  In Proc. of the 2006 ACM Conference on Information and
Knowledge Management

Authors  Jian Zhang
Joan Feigenbaum 
We consider the problem of finding highly correlated pairs in a large data set. That is, given a threshold not too small, we wish to report all the pairs of items (or binary attributes) whose (Pearson) correlation coefficients are greater than the threshold. Correlation analysis is an important step in many statistical and knowledgediscovery tasks. Normally, the number of highly correlated pairs is quite small compared to the total number of pairs. Identifying highly correlated pairs in a naive way by computing the correlation coefficients for all the pairs is wasteful. With massive data sets, where the total number of pairs may exceed the mainmemory capacity, the computational cost of the naive method is prohibitive. In their KDD'04 paper [15], Hui Xiong et al. address this problem by proposing the TAPER algorithm. The algorithm goes through the data set in two passes. It uses the first pass to generate a set of candidate pairs whose correlation coefficients are then computed directly in the second pass. The efficiency of the algorithm depends greatly on the selectivity (pruning power) of its candidategenerating stage. In this work, we adopt the general framework of the TAPER algorithm but propose a different candidategeneration method. For a pair of items, TAPER's candidategeneration method considers only the frequencies (supports) of individual items. Our method also considers the frequency (support) of the pair but does not explicitly count this frequency (support). We give a simple randomized algorithm whose falsenegative probability is negligible. The space and time complexities of generating the candidate set in our algorithm are asymptotically the same as TAPER's. We conduct experiments on synthesized and real data. The results show that our algorithm produces a greatly reduced candidate set  one that can be several orders of magnitude smaller than that generated by TAPER. Because of this, our algorithm uses much less memory and can be faster. The former is critical for dealing with massive data.