[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Order statistics and estimating cardinalities of massive data sets

Published: 20 January 2009 Publication History

Abstract

A new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data is introduced here. It is based on order statistics rather than on bit patterns in binary representations of numbers. Three families of estimators are analyzed. They attain a standard error of 1M using M units of storage, which places them in the same class as the best known algorithms so far. The algorithms have a very simple internal loop, which gives them an advantage in terms of processing speed. For instance, a memory of only 12 kB and only few seconds are sufficient to process a multiset with several million elements and to build an estimate with accuracy of order 2 percent. The algorithms are validated both by mathematical analysis and by experimentations on real internet traffic.

References

[1]
Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D. and Trevisan, Luca, Counting distinct elements in a data stream. In: RANDOM '02: Proceedings of the 6th International Workshop on Randomization and Approximation Techniques, Springer-Verlag. pp. 1-10.
[2]
Broder, A., On the resemblance and containment of documents. In: SEQUENCES '97: Proceedings of the Compression and Complexity of Sequences 1997, IEEE Computer Society, Washington, DC, USA. pp. 21
[3]
Broder, Andrei Z., Identifying and filtering near-duplicate documents. In: COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, Springer-Verlag, London, UK. pp. 1-10.
[4]
P. Chassaing, L. Gerin, Efficient estimation of the cardinality of large data sets, in: math.ST/0701347, 2007. Extended abstract in the proceedings of the 4th Colloquium on Mathematics and Computer Science, 2007, pp.¿419-422
[5]
Considine, J., Li, F., Kollios, G. and Byers, J., Approximate aggregation techniques for sensor databases. In: ICDE '04: Proceedings of the 20th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA. pp. 449
[6]
Durand, M. and Flajolet, P., Loglog counting of large cardinalities. In: Di Battista, G., Zwick, U. (Eds.), Lecture Notes in Computer Science, vol.~2832. pp. 605-617.
[7]
Estan, C., Varghese, G. and Fisk, M., Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Trans. Netw. v14 i5. 925-937.
[8]
Flajolet, P., Adaptive sampling. In: Hazewinkel, M. (Ed.), Encyclopaedia of Mathematics, vol.~Suppl.~I. Kluwer Academic Publishers, Dordrecht. pp. 28
[9]
P. Flajolet, E. Fusy, O. Gandouet, F. Meunier, Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm, in: Philippe Jacquet (Ed.) Analysis of Algorithm 2007, AofA07, Discrete Mathematics and Theoretical Computer Science Proceedings, 2007 (in press)
[10]
Flajolet, P. and Martin, P.N., Probabilistic counting. In: Proceedings of the 24th Annual Symposium on Foundations of Computer Science, IEEE Computer Society Press. pp. 76-82.
[11]
C. Fraleigh, C. Diot, B. Lyles, S. Moon, P. Owezarski, K. Papagiannaki, F. Tobagi, Design and deployment of a passive monitoring infrastructure, in: Passive and Active Measurement Workshop, Amsterdam, April 2001
[12]
Fusy, í. and Giroire, F., Estimating the number of active flows in a data stream over a sliding window. In: Appelgate, David (Ed.), Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth Workshop on Analytic Algorithmics and Combinatorics, SIAM Press. pp. 223-231.
[13]
L. Getoor, B. Taskar, D. Koller, Selectivity estimation using probabilistic models, in: SIGMOD Conference, 2001
[14]
P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, VLDB J. (2001) 541-550
[15]
Giroire, F., Order statistics and estimating cardinalities of massive data sets. In: Martnez, Conrado (Ed.), International Conference on Analysis of Algorithms, volume AD of DMTCS Proceedings, pp. 157-166.
[16]
F. Giroire, Directions to use probabilistic algorithms for cardinality for DNA analysis, in: Journées Ouvertes Biologie Informatique Mathématiques, JOBIM 2006, pp.¿3-5, July 2006
[17]
F. Giroire, Réseaux, algorithmique et analyse combinatoire de grands ensembles, Ph.D. Thesis, Université Paris VI, November 2006
[18]
G. Iannaccone, C. Diot, I. Graham, N. McKeown, Monitoring very high speed links, in: ACM SIGCOMM Internet Measurement Workshop, San Francisco, CA, November 2001
[19]
Knuth, D.E., The Art of Computer Programming, Volume~3: Sorting and Searching. 1973. Addison-Wesley.
[20]
Whang, K.-Y., Zanden, B.T.V. and Taylor, H.M., A linear-time probabilistic counting algorithm for database applications. TODS. v15 i2. 208-229.

Cited By

View all
  • (2024)QSketch: An Efficient Sketch for Weighted Cardinality Estimation in StreamsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671695(2432-2443)Online publication date: 25-Aug-2024
  • (2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
  • (2023)Sketch-flip-mergeProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618930(12846-12865)Online publication date: 23-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Discrete Applied Mathematics
Discrete Applied Mathematics  Volume 157, Issue 2
January, 2009
240 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 20 January 2009

Author Tags

  1. Algorithm analysis
  2. Cardinality estimates
  3. Traffic analysis
  4. Very large multisets

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)QSketch: An Efficient Sketch for Weighted Cardinality Estimation in StreamsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671695(2432-2443)Online publication date: 25-Aug-2024
  • (2024)TTLs Matter: Efficient Cache Sizing with TTL-Aware Miss Ratio Curves and Working Set SizesProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650066(387-404)Online publication date: 22-Apr-2024
  • (2023)Sketch-flip-mergeProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618930(12846-12865)Online publication date: 23-Jul-2023
  • (2023)Better Cardinality Estimators for HyperLogLog, PCSA, and BeyondProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588680(317-327)Online publication date: 18-Jun-2023
  • (2023)Compressing Distributed Network Sketches With Traffic-Aware SummariesIEEE Transactions on Network and Service Management10.1109/TNSM.2022.317229920:2(1962-1975)Online publication date: 1-Jun-2023
  • (2023)Cardinality estimation with smoothing autoregressive modelsWorld Wide Web10.1007/s11280-023-01195-726:5(3441-3461)Online publication date: 28-Jul-2023
  • (2022)Order-invariant cardinality estimators are differentially privateProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601376(15204-15216)Online publication date: 28-Nov-2022
  • (2022)A learned query rewrite system using Monte Carlo tree searchProceedings of the VLDB Endowment10.14778/3485450.348545615:1(46-58)Online publication date: 14-Jan-2022
  • (2022)SHE: A Generic Framework for Data Stream Mining over Sliding WindowsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545009(1-12)Online publication date: 29-Aug-2022
  • (2022)HyperLogLogLogProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539246(753-761)Online publication date: 14-Aug-2022
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media