Abstract
We propose a new randomized count-based algorithm to solve the Heavy Hitters and Top-k problems in data streams. This algorithm, called LotterySampling, uses the intuitive concept of “lottery tickets”, to decide which elements to sample. We prove that LotterySampling is inside the \((\delta ,\epsilon )\)-deficient framework for the Heavy Hitters problem and that it has a similar performance to the well known StickySampling algorithm, although they are very different in nature. More importantly, we define a similar \((\delta ,\epsilon )\)-deficient framework for the harder Top-k problem and we prove that LotterySampling is inside it. Hence, LotterySampling can be used, without any previous assumption of the data distribution, as a probabilistic approximation scheme to find the k most frequent elements and to approximate the frequencies of the reported elements by a factor of \(1 - \epsilon \). To the best of our knowledge, this is the first algorithm that gives theoretical guarantees for the Top-k problem for unknown and arbitrary streams, which is the most important contribution of this paper. Its memory usage is adaptive to the distribution of the stream, and it will increase or decrease depending on whether the stream becomes less or more skewed. More precisely, the sample size depends, at any given moment, on the unknown \(k^{\text {th}}\) highest frequency but it is independent of the length of the stream and of the number of distinct elements. The user just needs to provide two parameters that determine the quality of the answers, independently of the stream. We compare LotterySampling with other existing probabilistic and deterministic algorithms showing its strengths and weaknesses.
This work has been supported by funds from the MOTION Project (Project PID2020-112581GB-C21) of the Spanish Ministry of Science & Innovation MCIN/AEI/10.13039/501100011033, while the second author was a postgraduate student at Universitat Politècnica de Catalunya.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bandi, N., Metwally, A., Agrawal, D., El Abbadi, A.: Fast data stream algorithms using associative memories. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 2007), pp. 247–256 (2007)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theoret. Comput. Sci. 312(1), 3–15 (2004)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 1998), pp. 331–342 (1998)
Lam, H., Calders, T.: Mining top-k frequent items in a data stream with flexible sliding windows. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2010), pp. 283–292 (2010)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB 2002), pp. 346–357. Morgan Kaufmann (2002)
Martínez, C., Solera-Pardo, G.: LotterySampling: a randomized algorithm for the heavy hitters and top-\(k\) problems in data streams (2022). https://www.researchgate.net/publication/362910274_LotterySampling_A_Randomized_Algorithm_for_the_Heavy_Hitters_and_Top-k_Problems_in_Data_Streams. Full version of the conference paper. Available on-line in ResearchGate
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30570-5_27
Muthukrishnan, S.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Martínez, C., Solera-Pardo, G. (2022). LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams. In: Zhang, Y., Miao, D., Möhring, R. (eds) Computing and Combinatorics. COCOON 2022. Lecture Notes in Computer Science, vol 13595. Springer, Cham. https://doi.org/10.1007/978-3-031-22105-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-22105-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22104-0
Online ISBN: 978-3-031-22105-7
eBook Packages: Computer ScienceComputer Science (R0)