Efficient Top-K Query Calculation in Distributed Networks

Pei Cao and Zhe Wang

Abstract

This paper presents a new algorithm to answer top-k queries (e.g. ``find the k objects with the highest aggregate values'') in a distributed network. Existing algorithms such as the Threshold Algorithm consume an excessive amount of bandwidth when the number of nodes, m, is high. We propose a new algorithm called ``Three-Phase Uniform Threshold'' (TPUT). TPUT reduces network bandwidth consumption by pruning away ineligible objects, and terminates in three round-trips regardless of data input.

The paper presents two sets of results about TPUT. First, trace-driven simulations show that, depending on the size of the network, TPUT reduces network traffic by one to two orders of magnitude compared to existing algorithms. Second, TPUT is proven to be instance-optimal on data series that satisfy a lower bound on the slope of decreases in values. In particular, analysis shows that by using a pruning parameter less than 1, TPUT achieves a qualitative reduction in network traffic, for example, lowering the optimality ratio from $O(m*m)$ to $O(m*\sqrt{m})$ for data series following Zipf distribution.

The paper in PDF is here.