[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams

  • Conference paper
  • First Online:
Computing and Combinatorics (COCOON 2022)

Abstract

We propose a new randomized count-based algorithm to solve the Heavy Hitters and Top-k problems in data streams. This algorithm, called LotterySampling, uses the intuitive concept of “lottery tickets”, to decide which elements to sample. We prove that LotterySampling is inside the \((\delta ,\epsilon )\)-deficient framework for the Heavy Hitters problem and that it has a similar performance to the well known StickySampling algorithm, although they are very different in nature. More importantly, we define a similar \((\delta ,\epsilon )\)-deficient framework for the harder Top-k problem and we prove that LotterySampling is inside it. Hence, LotterySampling can be used, without any previous assumption of the data distribution, as a probabilistic approximation scheme to find the k most frequent elements and to approximate the frequencies of the reported elements by a factor of \(1 - \epsilon \). To the best of our knowledge, this is the first algorithm that gives theoretical guarantees for the Top-k problem for unknown and arbitrary streams, which is the most important contribution of this paper. Its memory usage is adaptive to the distribution of the stream, and it will increase or decrease depending on whether the stream becomes less or more skewed. More precisely, the sample size depends, at any given moment, on the unknown \(k^{\text {th}}\) highest frequency but it is independent of the length of the stream and of the number of distinct elements. The user just needs to provide two parameters that determine the quality of the answers, independently of the stream. We compare LotterySampling with other existing probabilistic and deterministic algorithms showing its strengths and weaknesses.

This work has been supported by funds from the MOTION Project (Project PID2020-112581GB-C21) of the Spanish Ministry of Science & Innovation MCIN/AEI/10.13039/501100011033, while the second author was a postgraduate student at Universitat Politècnica de Catalunya.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 55.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 69.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bandi, N., Metwally, A., Agrawal, D., El Abbadi, A.: Fast data stream algorithms using associative memories. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 2007), pp. 247–256 (2007)

    Google Scholar 

  2. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theoret. Comput. Sci. 312(1), 3–15 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  3. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)

    Article  Google Scholar 

  4. Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD 1998), pp. 331–342 (1998)

    Google Scholar 

  5. Lam, H., Calders, T.: Mining top-k frequent items in a data stream with flexible sliding windows. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2010), pp. 283–292 (2010)

    Google Scholar 

  6. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases (VLDB 2002), pp. 346–357. Morgan Kaufmann (2002)

    Google Scholar 

  7. Martínez, C., Solera-Pardo, G.: LotterySampling: a randomized algorithm for the heavy hitters and top-\(k\) problems in data streams (2022). https://www.researchgate.net/publication/362910274_LotterySampling_A_Randomized_Algorithm_for_the_Heavy_Hitters_and_Top-k_Problems_in_Data_Streams. Full version of the conference paper. Available on-line in ResearchGate

  8. Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30570-5_27

    Chapter  Google Scholar 

  9. Muthukrishnan, S.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Conrado Martínez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Martínez, C., Solera-Pardo, G. (2022). LotterySampling: A Randomized Algorithm for the Heavy Hitters and Top-k Problems in Data Streams. In: Zhang, Y., Miao, D., Möhring, R. (eds) Computing and Combinatorics. COCOON 2022. Lecture Notes in Computer Science, vol 13595. Springer, Cham. https://doi.org/10.1007/978-3-031-22105-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22105-7_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22104-0

  • Online ISBN: 978-3-031-22105-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics