[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Stable learned bloom filters for data streams

Published: 01 July 2020 Publication History

Abstract

Bloom filter and its variants are elegant space-efficient probabilistic data structures for approximate set membership queries. It has been recently shown that the space cost of Bloom filters can be significantly reduced via a combination with pre-trained machine learning models, named Learned Bloom filters (LBF). LBF eases the space requirement of a Bloom filter by undertaking part of the queries using a classifier. However, current LBF structures generally target a static member set. Their performances would inevitably decay when there is a member update on the set, while this update requirement is not uncommon for real-world data streaming applications such as duplicate item detection, malicious URL checking, and web caching. To adapt LBF to data streams, we propose the Stable Learned Bloom Filters (SLBF) which addresses the performance decay issue on intensive insertion workloads by combining classifier with updatable backup filters. Specifically, we propose two SLBF structures, Single SLBF (s-SLBF) and Grouping SLBF (g-SLBF). The theoretical analysis on these two structures shows that the expected false positive rate (FPR) of SLBF is asymptotically a constant over the insertion of new members. Extensive experiments on real-world datasets show that SLBF introduces a similar level of false negative rate (FNR) but yields a better FPR/storage trade-off compared with the state-of-the-art (non-learned) Bloom filters optimized on data streams.

References

[1]
xxhash. http://cyan4973.github.io/xxHash/. [Online; accessed 28-Feb-2020].
[2]
Amazon.com - Employee Access Challenge. https://www.kaggle.com/c/amazon-employee-access-challenge/data, 2020. [Online; accessed 2-July-2020].
[3]
CatBoost - open-source gradient boosting library. https://catboost.ai/, 2020. [Online; accessed 2-July-2020].
[4]
UCI Machine Learning Repository: HIGGS Data Set. https://archive.ics.uci.edu/ml/datasets/HIGGS, 2020. [Online; accessed 2-July-2020].
[5]
UCI Machine Learning Repository: Kitsune Network Attack Dataset Data Set. https://archive.ics.uci.edu/ml/datasets/Kitsune+Network+Attack+Dataset, 2020. [Online; accessed 2-July-2020].
[6]
P. S. Almeida, C. Baquero, N. M. Preguiça, and D. Hutchison. Scalable bloom filters. Inf. Process. Lett., 101(6):255--261, 2007.
[7]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970.
[8]
K. Bratbergsengen. Hashing methods and relational algebra operations. In VLDB, pages 323--333. Morgan Kaufmann, 1984.
[9]
A. Z. Broder and M. Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2003.
[10]
B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: an efficient data structure for static support lookup tables. In SODA, pages 30--39. SIAM, 2004.
[11]
S. Cohen and Y. Matias. Spectral bloom filters. In SIGMOD Conference, pages 241--252. ACM, 2003.
[12]
Z. Dai and A. Shrivastava. Adaptive learned bloom filter (Ada-BF): Efficient utilization of the classifier. CoRR, abs/1910.09131, 2019.
[13]
F. Deng and D. Rafiei. Approximately detecting duplicates for streaming data using stable bloom filters. In SIGMOD Conference, pages 25--36. ACM, 2006.
[14]
J. Ding, U. F. Minhas, H. Zhang, Y. Li, C. Wang, B. Chandramouli, J. Gehrke, D. Kossmann, and D. B. Lomet. ALEX: an updatable adaptive learned index. CoRR, abs/1905.08898, 2019.
[15]
S. Dutta, A. Narang, and S. K. Bera. Streaming quotient filter: A near optimal approximate duplicate detection approach for data streams. PVLDB, 6(8):589--600, 2013.
[16]
B. Fan, D. G. Andersen, M. Kaminsky, and M. Mitzenmacher. Cuckoo filter: Practically better than bloom. In CoNEXT, pages 75--88. ACM, 2014.
[17]
L. Fan, P. Cao, J. M. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281--293, 2000.
[18]
D. Guo, J. Wu, H. Chen, and X. Luo. Theory and network applications of dynamic bloom filters. In INFOCOM. IEEE, 2006.
[19]
D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo. The dynamic bloom filters. IEEE Trans. Knowl. Data Eng., 22(1):120--133, 2010.
[20]
G. H. Hardy, J. E. Littlewood, G. Pólya, D. Littlewood, G. Pólya, et al. Inequalities. Cambridge university press, 1952.
[21]
S. Idreos, N. Dayan, W. Qin, M. Akmanalp, S. Hilgard, A. Ross, J. Lennon, V. Jain, H. Gupta, D. Li, and Z. Zhu. Design continuums and the path toward self-designing key-value stores that know and learn. In CIDR. www.cidrdb.org, 2019.
[22]
N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In WebDB, pages 25--30, 2005.
[23]
A. Kirsch and M. Mitzenmacher. Less hashing, same performance: Building a better bloom filter. Random Struct. Algorithms, 33(2):187--218, 2008.
[24]
T. Kraska, M. Alizadeh, A. Beutel, E. H. Chi, A. Kristo, G. Leclerc, S. Madden, H. Mao, and V. Nathan. Sagedb: A learned database system. In CIDR. www.cidrdb.org, 2019.
[25]
T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In SIGMOD Conference, pages 489--504. ACM, 2018.
[26]
A. Kumar, J. Xu, and J. Wang. Space-code bloom filter for efficient per-flow traffic measurement. IEEE J. Sel. Areas Commun., 24(12):2327--2339, 2006.
[27]
R. C. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. PVLDB, 12(11):1705--1718, 2019.
[28]
R. C. Marcus and O. Papaemmanouil. Plan-structured deep neural network models for query performance prediction. PVLDB, 12(11):1733--1746, 2019.
[29]
A. Metwally, D. Agrawal, and A. El Abbadi. Duplicate detection in click streams. In WWW, pages 12--21. ACM, 2005.
[30]
M. Mitzenmacher. A model for learned bloom filters and optimizing by sandwiching. In NeurIPS, pages 462--471, 2018.
[31]
J. K. Mullin. Optimal semijoins for distributed database systems. IEEE Trans. Software Eng., 16(5):558--560, 1990.
[32]
V. Nathan, J. Ding, M. Alizadeh, and T. Kraska. Learning multi-dimensional indexes. CoRR, abs/1912.01668, 2019.
[33]
J. W. Rae, S. Bartunov, and T. P. Lillicrap. Meta-learning neural bloom filters. In ICML, volume 97, pages 5271--5280. PMLR, 2019.
[34]
J. Sun and G. Li. An end-to-end learning-based cost estimator. PVLDB, 13(3):307--319, 2019.
[35]
C. Wu, A. Jindal, S. Amizadeh, H. Patel, W. Le, S. Qiao, and S. Rao. Towards a learning optimizer for shared clouds. PVLDB, 12(3):210--222, 2018.
[36]
Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, P. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. PVLDB, 13(3):279--292, 2019.
[37]
P. Zhao and S. C. H. Hoi. Cost-sensitive online active learning with application to malicious URL detection. In KDD, pages 919--927. ACM, 2013.

Cited By

View all
  • (2024)Optimizing Collections of Bloom Filters within a Space BudgetProceedings of the VLDB Endowment10.14778/3681954.368202017:11(3551-3564)Online publication date: 1-Jul-2024
  • (2024)Enabling space-time efficient range queries with REncoderThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00873-w33:6(1837-1859)Online publication date: 1-Nov-2024
  • (2023)Learned Index: A Comprehensive Experimental EvaluationProceedings of the VLDB Endowment10.14778/3594512.359452816:8(1992-2004)Online publication date: 22-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 13, Issue 12
August 2020
1710 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2020
Published in PVLDB Volume 13, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)121
  • Downloads (Last 6 weeks)11
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing Collections of Bloom Filters within a Space BudgetProceedings of the VLDB Endowment10.14778/3681954.368202017:11(3551-3564)Online publication date: 1-Jul-2024
  • (2024)Enabling space-time efficient range queries with REncoderThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00873-w33:6(1837-1859)Online publication date: 1-Nov-2024
  • (2023)Learned Index: A Comprehensive Experimental EvaluationProceedings of the VLDB Endowment10.14778/3594512.359452816:8(1992-2004)Online publication date: 22-Jun-2023
  • (2023)PA-LBFInternational Journal of Intelligent Systems10.1155/2023/49707762023Online publication date: 1-Jan-2023
  • (2023)A Learned Cuckoo Filter for Approximate Membership Queries over Variable-sized Sliding Windows on Data StreamsProceedings of the ACM on Management of Data10.1145/36267581:4(1-26)Online publication date: 12-Dec-2023
  • (2023)ChainedFilter: Combining Membership Filters by Chain RuleProceedings of the ACM on Management of Data10.1145/36267211:4(1-27)Online publication date: 12-Dec-2023
  • (2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
  • (2023)Learned Bloom Filter for Multi-key Membership TestingDatabase Systems for Advanced Applications10.1007/978-3-031-30637-2_5(62-79)Online publication date: 17-Apr-2023
  • (2022)New wine in an old bottleProceedings of the VLDB Endowment10.14778/3538598.353861315:9(1924-1936)Online publication date: 27-Jul-2022
  • (2022)HAP: An Efficient Hamming Space Index Based on Augmented Pigeonhole PrincipleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517880(917-930)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media