Abstract
We consider the problem of approximate sorting of a data stream (in one pass) with limited internal storage where the goal is not to rearrange data but to output a permutation that reflects the ordering of the elements of the data stream as closely as possible. Our main objective is to study the relationship between the quality of the sorting and the amount of available storage. To measure quality, we use permutation distortion metrics, namely the Kendall tau, Chebyshev, and weighted Kendall metrics, as well as mutual information, between the output permutation and the true ordering of data elements. We provide bounds on the performance of algorithms with limited storage and present a simple algorithm that asymptotically requires a constant factor as much storage as an optimal algorithm in terms of mutual information and average Kendall tau distortion. We also study the case in which only information about the most recent elements of the stream is available. This setting has applications to learning user preference rankings in services such as Netflix, where items are presented to the user one at a time.
Similar content being viewed by others
References
Apostol TM (1976) Introduction to analytic number theory. Springer, New York
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM symposium on principles of database systems (PODS), New York
Carterette B (2009) On rank correlation and the distance between rankings. In: Proceedings of 32nd international SIGIR conference on research and development in information retrieval, ACM Press, New York, pp 436–443
Chakrabarti A, Jayram TS, Pǎtraşcu M (2008) Tight lower bounds for selection in randomly ordered streams. In: ACM-SIAM symposium on discrete algorithms (SODA), Society for Industrial and Applied Mathematics, Philadelphia, pp 720–729
Chen CP, Qi F (2008) The best lower and upper bounds of harmonic sequence. Glob J Appl Math Math Sci 1(1):41–49
Corless RM, Gonnet GH, Hare DEG, Jeffrey DJ, Knuth DE (1996) On the Lambert W function. Adv Comput Math 5(1):329–359. doi:10.1007/BF02124750
Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York
Diaconis P (1988) Group representations in probability and statistics, vol 11. Institute of Mathematical Statistics, Hayward
Farnoud F, Milenkovic O (2013) Aggregating rankings with positional constraints. In: Proceedings of IEEE information theory workshop (ITW), Seville
Farnoud F, Schwartz M, Bruck J (2014a) Rate-distortion for ranking with incomplete information. arXiv preprint: http://arxiv.org/abs/1401.3093
Farnoud F, Schwartz M, Bruck J (2014b) Bounds for permutation rate-distortion. In: Proceedings of IEEE international symposium on information theory (ISIT), Honolulu
Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. In: Proceedings of ACM SIGMOD international conference on management of data, ACM, New York, pp 58–66. doi:10.1145/375663.375670
Hassanzadeh F (2013) Distances on rankings: from social choice to flash memories. Ph.D. thesis, University of Illinois at Urbana–Champaign. http://hdl.handle.net/2142/44268
Holst L (1980) On the lengths of the pieces of a stick broken at random. J Appl Probab 17(3):623–634
Kemeny JG (1959) Mathematics without numbers. Daedalus 88(4):577–591
Kumar R, Vassilvitskii S (2010) Generalized distances between rankings. In: Proceedings of 19th international world wide web conference, Raleigh, pp. 571–580
Manku GS, Rajagopalan S, Lindsay BG (1998) Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of ACM SIGMOD international conference on management of data, ACM, New York, pp 426–435. doi:10.1145/276304.276342
McGregor A, Valiant P (2012) The shifting sands algorithm. In: ACM-SIAM symposium on discrete algorithms (SODA), SIAM, pp 453–458. http://www.dl.acm.org/citation.cfm?id=2095116.2095155
Munro J, Paterson M (1980) Selection and sorting with limited storage. Theor Comput Sci 12(3):315–323. http://www.sciencedirect.com/science/article/pii/0304397580900614
Sedgewick R, Wayne K (2011) Algorithms, 4th edn. Addison-Wesley Professional, Reading
Shieh GS (1998) A weighted Kendall’s tau statistic. Stat Probab Lett 39(1):17–24
Yilmaz E, Aslam JA, Robertson S (2008) A new rank correlation coefficient for information retrieval. In: Proceedings of 31st annual international SIGIR conference research and development in information retrieval, ACM, New York, pp 587–594
Acknowledgments
The authors would like to thank Ryan Gabrys and Yue Li for useful discussions and comments. Furthermore, the authors thank anonymous reviewers whose comments greatly improved this paper.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of (10)
Lemma 6
We have
Proof
To prove the upper bound on \(\sum _{k=2}^{n-m+1}\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) k\lg k\), we use Abel’s identity Apostol (1976, Theorem 4.2), which states that for an arithmetic function a, a real number x, and a function f with a continuous derivative on \(\left[ 1,x\right] \), we have
where \(A\left( y\right) =\sum _{1\le k\le y}a\left( k\right) \) for \(y\in {\mathbb {R}}\). To use Abel’s identity, we let \(a\left( k\right) =k\left( {\begin{array}{c}n-k-1\\ m-2\end{array}}\right) \) and \(f\left( k\right) =\ln k\). Hence,
for \(y\ge 1\) and \(A\left( y\right) =0\) for \(y<1\). By Abel’s identity, we have
The first term on the right side equals \(\left( {\begin{array}{c}n\\ m\end{array}}\right) \ln \left( n-m+1\right) \) and for the second term, we find
Hence,
We proceed as follows:
where we have used the fact that for nonnegative integers \(\ell ,j\), we have
proved below, to obtain the third equality.
To prove
let us write it as
The proof is by induction. The equality (19) holds for \(j=0\) as both sides reduce to \(\sum _{i=1}^{\ell }\frac{1}{i}\). As induction hypothesis, suppose (19) holds for a certain value of j. We show that it also holds for \(j+1\). We have
where for \(\mathsf {(a)}\) we have used the induction hypothesis. \(\square \)
1.2 Proof of (14)
In this subsection, we prove (14) as follows
where \(\mathsf {(a)}\) follows from (13) and where we have used \(\sum _{a=0}^{k+O\left( 1\right) }\left( {\begin{array}{c}a+O\left( 1\right) \\ 2\end{array}}\right) =\frac{k^{3}}{6}+O\left( k^{2}\right) \) and \(\frac{1}{a}\left( {\begin{array}{c}a+2\\ 3\end{array}}\right) =\frac{1}{3}\left( {\begin{array}{c}a+2\\ 2\end{array}}\right) \).
Rights and permissions
About this article
Cite this article
Farnoud, F., Yaakobi, E. & Bruck, J. Approximate sorting of data streams with limited storage. J Comb Optim 32, 1133–1164 (2016). https://doi.org/10.1007/s10878-015-9930-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10878-015-9930-6