[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Memento Filter: A Fast, Dynamic, and Robust Range Filter

Published: 20 December 2024 Publication History

Abstract

Range filters are probabilistic data structures that answer approximate range emptiness queries. They aid in avoiding processing empty range queries and have use cases in many application domains such as key-value stores and social web analytics. However, current range filters do not support dynamically changing and growing datasets. Moreover, several of these designs also exhibit impractically high false positive rates under correlated workloads, which are common in practice. These impediments restrict the applicability of range filters across a wide range of use cases.
We introduce Memento filter, the first range filter to simultaneously offer dynamicity, fast operations, and a robust false positive rate for any workload. Memento filter partitions the key universe and clusters its keys according to this partitioning. For each cluster, it stores a fingerprint and a list of key suffixes contiguously. The encoding of these lists makes them amenable to existing dynamic filter structures. Due to the one-to-one mapping from keys to suffixes, Memento filter supports inserts and deletes and can even expand to accommodate a growing dataset.
We implement Memento filter on top of a Rank-and-Select Quotient filter and InfiniFilter and demonstrate that it achieves a competitive false positive rate and performance with the state of the art while also providing dynamicity. Due to its dynamicity, Memento filter is the first range filter applicable to B-Trees. We showcase this by integrating Memento filter into WiredTiger, a B-Tree-based key-value store, significantly boosting its performance for mixed workloads.

References

[1]
2008. More Geometric Data Structures. Springer Berlin Heidelberg, Berlin, Heidelberg, 219--241. https://doi.org/10.1007/978--3--540--77974--2_10
[2]
Karolina Alexiou, Donald Kossmann, and Per-Åke Larson. 2013. Adaptive range filters for cold data: avoiding trips to Siberia. Proc. VLDB Endow. 6, 14 (sep 2013), 1714--1725. https://doi.org/10.14778/2556549.2556556
[3]
Jim Apple. 2022. Stretching your data with taffy filters. Software: Practice and Experience (2022).
[4]
Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: a database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, New York, USA) (SIGMOD '13). Association for Computing Machinery, New York, NY, USA, 1185--1196. https://doi.org/10.1145/2463676.2465296
[5]
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (jul 1970), 422--426. https://doi.org/10.1145/362686.362692
[6]
Anja Bog. 2013. Benchmarking Transaction and Analytical Processing Systems: The Creation of a Mixed Workload Benchmark and its Application. Springer Publishing Company, Incorporated.
[7]
Andrei Broder and Michael Mitzenmacher. 2003. Survey: Network Applications of Bloom Filters: A Survey. Internet Mathematics 1 (11 2003). https://doi.org/10.1080/15427951.2004.10129096
[8]
Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook's distributed data store for the social graph. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (San Jose, CA) (USENIX ATC'13). USENIX Association, USA, 49--60.
[9]
Pedro Celis, Per-Ake Larson, and J. Ian Munro. 1985. Robin hood hashing. In 26th Annual Symposium on Foundations of Computer Science (sfcs 1985). 281--288. https://doi.org/10.1109/SFCS.1985.48
[10]
Guanduo Chen, Zhenying He, Meng Li, and Siqiang Luo. 2024. Oasis: An Optimal Disjoint Segmented Learned Range Filter. Proc. VLDB Endow. 17, 8 (may 2024), 1911--1924. https://doi.org/10.14778/3659437.3659447
[11]
Clark, David. 1997. Compact PAT trees. Ph. D. Dissertation. http://hdl.handle.net/10012/64
[12]
Douglas Comer. 1979. Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (jun 1979), 121--137. https://doi.org/10.1145/356770.356776
[13]
Alex Conway, Abhishek Gupta, Vijay Chidambaran, Martin Farach-Colton, Rick Spillane, Amy Tai, and Rob Johnson. 2020. SplinterDB: closing the bandwidth gap for NVMe key-value stores. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC'20). USENIX Association, USA, Article 4, 15 pages.
[14]
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (Indianapolis, Indiana, USA) (SoCC '10). Association for Computing Machinery, New York, NY, USA, 143--154. https://doi.org/10.1145/1807128.1807152
[15]
Marco Costa, Paolo Ferragina, and Giorgio Vinciguerra. 2023. Grafite: Taming Adversarial Queries with Optimal Range Filters. arXiv:2311.15380 [cs.DS]
[16]
Niv Dayan, Ioana Bercea, and Rasmus Pagh. 2024. Aleph Filter: To Infinity in Constant Time. arXiv:2404.04703 [cs.DB] https://arxiv.org/abs/2404.04703
[17]
Niv Dayan, Ioana Bercea, Pedro Reviriego, and Rasmus Pagh. 2023. InfiniFilter: Expanding Filters to Infinity and Beyond. Proc. ACM Manag. Data 1, 2, Article 140 (jun 2023), 27 pages. https://doi.org/10.1145/3589285
[18]
Niv Dayan and Moshe Twitto. 2021. Chucky: A Succinct Cuckoo Filter for LSM-Tree. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 365--378. https://doi.org/10.1145/3448016.3457273
[19]
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications. ACM Trans. Storage 17, 4, Article 26 (oct 2021), 32 pages. https://doi.org/10.1145/3483840
[20]
Wolfgang Effelsberg and Theo Haerder. 1984. Principles of database buffer management. ACM Trans. Database Syst. 9, 4 (dec 1984), 560--595. https://doi.org/10.1145/1994.2022
[21]
Peter Elias. 1974. Efficient Storage and Retrieval by Content and Address of Static Files. J. ACM 21, 2 (apr 1974), 246--260. https://doi.org/10.1145/321812.321820
[22]
Zhuochen Fan, Bowen Ye, Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Zirui Liu, and Bin Cui. 2024. Enabling space-time efficient range queries with REncoder. The VLDB Journal (07 Aug 2024). https://doi.org/10.1007/s00778-024-00873-w
[23]
R.M. Fano. 1971. On the Number of Bits Required to Implement an Associative Memory. MIT Project MAC Computer Structures Group. https://books.google.ca/books?id=07DeGwAACAAJ
[24]
R. Gallager and D. van Voorhis. 1975. Optimal source codes for geometrically distributed integer alphabets (Corresp.). IEEE Trans. Inf. Theor. 21, 2 (sep 1975), 228--230. https://doi.org/10.1109/TIT.1975.1055357
[25]
Gartner. 2014. Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation. https://www.gartner.com/en/documents/2657815
[26]
Mayank Goswami, Allan Grønlund, Kasper Green Larsen, and Rasmus Pagh. 2015. Approximate Range Emptiness in Constant Time and Optimal Space. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms (San Diego, California) (SODA '15). Society for Industrial and Applied Mathematics, USA, 769--775.
[27]
Tamer Kahveci and Ambuj K. Singh. 2001. Variable Length Queries for Time Series Data. In Proceedings of the 17th International Conference on Data Engineering. IEEE Computer Society, USA, 273--282.
[28]
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019).
[29]
Eric R. Knorr, Baptiste Lemaire, Andrew Lim, Siqiang Luo, Huanchen Zhang, Stratos Idreos, and Michael Mitzenmacher. 2022. Proteus: A Self-Designing Range Filter. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD '22). Association for Computing Machinery, New York, NY, USA, 1670--1684. https://doi.org/10.1145/3514221.3526167
[30]
Florian Kurpicz. 2022. Engineering Compact Data Structures for Rank and Select Queries on Bit Vectors. In String Processing and Information Retrieval -- 29th International Symposium, SPIRE 2022, Concepción, Chile, November 8--10, 2022, Proceedings. Ed.: D. Arroyuelo (Lecture Notes in Computer Science, Vol. 13617). Springer International Publishing, 257--272. https://doi.org/10.1007/978--3-031--20643--6_19
[31]
Cockroach Labs. 2015. . https://github.com/cockroachdb/cockroach
[32]
Siqiang Luo, Subarna Chatterjee, Rafael Ketsetsidis, Niv Dayan, Wilson Qin, and Stratos Idreos. 2020. Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 2071--2086. https://doi.org/10.1145/3318464.3389731
[33]
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.
[34]
MongoDB. 2024. The Developer Data Platform. https://www.mongodb.com/
[35]
MongoDB. 2024. WiredTiger Storage Engine. https://www.mongodb.com/docs/manual/core/wiredtiger/
[36]
Bernhard Mößner, Christian Riegger, Arthur Bernhardt, and Ilia Petrov. 2022. bloomRF: On Performing Range-Queries in Bloom-Filters with Piecewise-Monotone Hash Functions and Prefix Hashing. arXiv:2207.04789 [cs.DB]
[37]
Daisuke Okanohara and Kunihiko Sadakane. 2007. Practical entropy-compressed rank/select dictionary. In Proceedings of the Meeting on Algorithm Engineering & Expermiments (New Orleans, Louisiana). Society for Industrial and Applied Mathematics, USA, 60--70.
[38]
Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval (Gold Coast, Queensland, Australia) (SIGIR '14). Association for Computing Machinery, New York, NY, USA, 273--282. https://doi.org/10.1145/2600428.2609615
[39]
Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. 2017. A General-Purpose Counting Filter: Making Every Bit Count. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 775--787. https://doi.org/10.1145/3035918.3035963
[40]
Prashant Pandey, Martín Farach-Colton, Niv Dayan, and Huanchen Zhang. 2024. Beyond Bloom: A Tutorial on Future Feature-Rich Filters. In Companion of the 2024 International Conference on Management of Data (Santiago AA, Chile) (SIGMOD/PODS '24). Association for Computing Machinery, New York, NY, USA, 636--644. https://doi.org/10.1145/3626246.3654681
[41]
Raghu Ramakrishnan and Johannes Gehrke. 2002. Database Management Systems (3 ed.). McGraw-Hill, Inc., USA.
[42]
Kai Ren, Qing Zheng, Joy Arulraj, and Garth Gibson. 2017. SlimDB: a space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (sep 2017), 2037--2048. https://doi.org/10.14778/3151106.3151108
[43]
Russell Sears, Mark Callaghan, and Eric Brewer. 2008. Rose: compressed, log-structured replication. Proc. VLDB Endow. 1, 1 (aug 2008), 526--537. https://doi.org/10.14778/1453856.1453914
[44]
Kapil Vaidya, Subarna Chatterjee, Eric Knorr, Michael Mitzenmacher, Stratos Idreos, and Tim Kraska. 2022. SNARF: A Learning-Enhanced Range Filter. Proc. VLDB Endow. 15, 8 (apr 2022), 1632--1644. https://doi.org/10.14778/3529337.3529347
[45]
Hengrui Wang, Te Guo, Junzhao Yang, and Huanchen Zhang. 2024. GRF: A Global Range Filter for LSM-Trees with Shape Encoding. In Proceedings of the 2024 ACM SIGMOD International Conference on Management of Data (Santiago, Chile) (SIGMOD '24). Association for Computing Machinery, New York, NY, USA.
[46]
Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Huanchen Zhang, and Bin Cui. 2023. REncoder: A Space-Time Efficient Range Filter with Local Encoder. In 2023 IEEE 39th International Conference on Data Engineering (ICDE). 2036--2049. https://doi.org/10.1109/ICDE55515.2023.00158
[47]
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G. Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. SuRF: Practical Range Query Filtering with Fast Succinct Tries. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 323--336. https://doi.org/10.1145/3183713.3196931
[48]
Dong Zhou, David G. Andersen, and Michael Kaminsky. 2013. Space-Efficient, High-Performance Rank and Select Structures on Uncompressed Bit Sequences. In Experimental Algorithms, Vincenzo Bonifaci, Camil Demetrescu, and Alberto Marchetti-Spaccamela (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 151--163.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 6
SIGMOD
December 2024
792 pages
EISSN:2836-6573
DOI:10.1145/3709598
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 December 2024
Published in PACMMOD Volume 2, Issue 6

Permissions

Request permissions for this article.

Author Tags

  1. data growth
  2. dynamic data structure
  3. range filter
  4. scalability

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 22
    Total Downloads
  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)22
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media