[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

BtrBlocks: Efficient Columnar Compression for Data Lakes

Published: 20 June 2023 Publication History

Abstract

Analytics is moving to the cloud and data is moving into data lakes. These reside on object storage services like S3 and enable seamless data sharing and system interoperability. To support this, many systems build on open storage formats like Apache Parquet. However, these formats are not optimized for remotely-accessed data lakes and today's high-throughput networks. Inefficient decompression makes scans CPU-bound and thus increases query time and cost. With this work we present BtrBlocks, an open columnar storage format designed for data lakes. BtrBlocks uses a set of lightweight encoding schemes, achieving fast and efficient decompression and high compression ratios.

Supplemental Material

MP4 File
BtrBlocks: Efficient Columnar Compression for Data Lakes - Presentation video for SIGMOD 2023

References

[1]
October 11, 2022. https://github.com/lemire/FastPFor.
[2]
October 11, 2022. https://github.com/cwida/fsst.
[3]
October 14, 2022. https://github.com/apache/arrow/blob/883580883aab748fe94336cbed844f09e015178f/cpp/src/parquet/column_writer.cc#L1376.
[4]
October 14, 2022. https://aws.amazon.com/ec2/pricing/on-demand/.
[5]
October 14, 2022. https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html.
[6]
October 14, 2022. https://aws.amazon.com/sdk-for-cpp/.
[7]
October 4, 2022. https://github.com/RoaringBitmap/CRoaring.
[8]
October 4, 2022. https://orc.apache.org/specification/ORCv1.
[9]
October 6, 2022. https://github.com/cwida/public_bi_benchmark/blob/master/benchmark/CommonGovernment/samples/CommonGovernment_1.sample.csv.
[10]
October 6, 2022. https://github.com/cwida/public_bi_benchmark/blob/master/benchmark/Generico/samples/Generico_1.sample.csv.
[11]
September 20, 2022. https://github.com/google/snappy.
[12]
September 20, 2022. https://github.com/facebook/zstd.
[13]
September 20, 2022. https://parquet.apache.org/docs/file-format/data-pages/encodings/.
[14]
September 20, 2022. https://github.com/lz4/lz4.
[15]
September 20, 2022. https://parquet.apache.org/.
[16]
September 21, 2022. https://oneapi-src.github.io/oneTBB/.
[17]
September 24, 2022. https://github.com/cwida/public_bi_benchmark.
[18]
September 24, 2022. https://aws.amazon.com/ec2/instance-types/c5/.
[19]
September 27, 2022. https://github.com/cwida/fsst/blob/master/fsst.h#L144.
[20]
September 29, 2022. https://arrow.apache.org/docs/cpp/api/utilities.html?highlight=lz4#compression.
[21]
Daniel Abadi, Peter A. Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Found. Trends Databases 5, 3 (2013), 197--280.
[22]
Daniel J. Abadi, Samuel Madden, and Miguel Ferreira. 2006. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference. ACM, 671--682.
[23]
Josep Aguilar-Saborit and Raghu Ramakrishnan. 2020. POLARIS: The Distributed SQL Engine in Azure Synapse. Proc. VLDB Endow. 13, 12 (2020), 3204--3216.
[24]
Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-invented. In SIGMOD. 2205--2217.
[25]
Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bussel, Herman Van Hovell, Maryann Xue, Reynold Xin, and Matei Zaharia. 2022. Photon: A Fast Query Engine for Lakehouse Systems. In SIGMOD. 2326--2339.
[26]
Peter A. Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: Fast Random Access String Compression. PVLDB 13, 11 (2020), 2649--2661.
[27]
Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR. 225--237.
[28]
Martin Burtscher and Paruj Ratanaworabhan. 2007. High Throughput Compression of Double-Precision Floating-Point Data. In DCC. IEEE Computer Society, 293--302.
[29]
Benoît Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In SIGMOD. 215--226.
[30]
Patrick Damme, Dirk Habich, Juliana Hildebrandt, and Wolfgang Lehner. 2017. Lightweight Data Compression Algorithms: An Experimental Survey. In EDBT. 72--83.
[31]
Patrick Damme, Annett Ungethüm, Juliana Hildebrandt, Dirk Habich, and Wolfgang Lehner. 2019. From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms. ACM Trans. Database Syst. 44, 3 (2019), 9:1--9:46.
[32]
Ziqiang Feng, Eric Lo, Ben Kao, and Wenjian Xu. 2015. ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout. In SIGMOD Conference. ACM, 31--46.
[33]
Bogdan Ghita, Diego G. Tomé, and Peter A. Boncz. 2020. White-box Compression: Learning and Exploiting Compact Table Representations. In CIDR.
[34]
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD Conference. ACM, 1917--1923.
[35]
Ryan Johnson, Vijayshankar Raman, Richard Sidle, and Garret Swart. 2008. Row-wise parallel predicate evaluation. Proc. VLDB Endow. 1, 1 (2008), 622--634.
[36]
Harald Lang, Tobias Mühlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2016. Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. In SIGMOD. 311--326.
[37]
Per-Åke Larson, Cipri Clinciu, Campbell Fraser, Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar Rangarajan, Remus Rusanu, and Mayukh Saubhasik. 2013. Enhancements to SQL server column stores. In SIGMOD. 1159--1168.
[38]
Per-Ake Larson, Cipri Clinciu, Eric N. Hanson, Artem Oks, Susan L. Price, Srikumar Rangarajan, Aleksandras Surna, and Qingqing Zhou. 2011. SQL server column store indexes. In SIGMOD. 1177--1184.
[39]
Robert Lasch, Ismail Oukid, Roman Dementiev, Norman May, Süleyman Sirri Demirsoy, and Kai-Uwe Sattler. 2019. Fast & Strong: The Case of Compressed String Dictionaries on Modern CPUs. In DaMoN. 4:1--4:10.
[40]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? PVLDB 9, 3 (2015), 204--215.
[41]
Viktor Leis and Maximilian Kuschewski. 2021. Towards Cost-Optimal Query Processing in the Cloud. PVLDB 14, 9 (2021), 1606--1612.
[42]
Daniel Lemire and Leonid Boytsov. 2012. Decoding billions of integers per second through vectorization. CoRR abs/1209.2137 (2012). arXiv:1209.2137 http://arxiv.org/abs/1209.2137
[43]
Daniel Lemire, Gregory Ssi Yan Kai, and Owen Kaser. 2016. Consistently faster and smaller compressed bitmaps with Roaring. CoRR abs/1603.06549 (2016).
[44]
Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, and Gregory Ssi Yan Kai. 2017. Roaring Bitmaps: Implementation of an Optimized Software Library. CoRR abs/1709.07821 (2017).
[45]
Yinan Li and Jignesh M. Patel. 2013. BitWeaving: fast scans for main memory data processing. In SIGMOD Conference. ACM, 289--300.
[46]
Panagiotis Liakos, Katia Papakonstantinopoulou, and Yannis Kotidis. 2022. Chimp: Efficient Lossless Floating Point Compression for Time Series Databases. PVLDB 15, 11 (2022), 3058--3070.
[47]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3, 1 (2010), 330--339.
[48]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Min, Mosha Pasumansky, and Jeff Shute. 2020. Dremel: A Decade of Interactive SQL Analysis at Web Scale. PVLDB 13, 12 (2020), 3461--3472.
[49]
Ingo Müller, Cornelius Ratsch, and Franz Färber. 2014. Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems. In EDBT. 283--294.
[50]
Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB 4, 9 (2011), 539--550.
[51]
Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816--1827.
[52]
Orestis Polychroniou and Kenneth A. Ross. 2015. Efficient Lightweight Compression Alongside Fast Scans. In DaMoN. ACM, 9:1--9:6.
[53]
Vijayshankar Raman, Gopi K. Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M. Lohman, Tim Malkemus, René Müller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam J. Storm, and Liping Zhang. 2013. DB2 with BLU Acceleration: So Much More than Just a Column Store. PVLDB 6, 11 (2013), 1080--1091.
[54]
Alice Rey, Michael Freitag, and Thomas Neumann. 2023. Seamless Integration of Parquet Files into Data Processing. In BTW (LNI, Vol. P-331). Gesellschaft für Informatik e.V., 235--258.
[55]
Alexander van Renen and Viktor Leis. 2023. Cloud Analytics Benchmark. PVLDB 16, 6 (2023), 1413--1425.
[56]
Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Mühlbauer, Thomas Neumann, and Manuel Then. 2018. Get Real: How Benchmarks Fail to Represent the Real World. In DBTest. 1:1--1:6.
[57]
Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2017. An Experimental Study of Bitmap Compression vs. Inverted List Compression. In SIGMOD. 993--1008.
[58]
Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Faerber. 2013. Vectorizing Database Column Scans with Complex Predicates. In ADMS@VLDB. 1--12.
[59]
Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. Proc. VLDB Endow. 2, 1 (2009), 385--394.
[60]
Matei Zaharia, Ali Ghodsi, Reynold Xin, and Michael Armbrust. 2021. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. In CIDR.
[61]
Marcin Zukowski, Sándor Héman, Niels Nes, and Peter A. Boncz. 2006. Super-Scalar RAM-CPU Cache Compression. In ICDE. 59.

Cited By

View all
  • (2024)Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAPProceedings of the VLDB Endowment10.14778/3681954.368200117:11(3290-3303)Online publication date: 30-Aug-2024
  • (2024)Blitzcrank: Fast Semantic Compression for In-Memory Online Transaction ProcessingProceedings of the VLDB Endowment10.14778/3675034.367504417:10(2528-2540)Online publication date: 6-Aug-2024
  • (2024)Everything You Always Wanted to Know About Storage Compressibility of Pre-Trained ML Models but Were Afraid to AskProceedings of the VLDB Endowment10.14778/3659437.365945617:8(2036-2049)Online publication date: 31-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023
Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Author Tags

  1. columnar storage
  2. compression
  3. data lake
  4. query processing

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)370
  • Downloads (Last 6 weeks)36
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Two Birds With One Stone: Designing a Hybrid Cloud Storage Engine for HTAPProceedings of the VLDB Endowment10.14778/3681954.368200117:11(3290-3303)Online publication date: 30-Aug-2024
  • (2024)Blitzcrank: Fast Semantic Compression for In-Memory Online Transaction ProcessingProceedings of the VLDB Endowment10.14778/3675034.367504417:10(2528-2540)Online publication date: 6-Aug-2024
  • (2024)Everything You Always Wanted to Know About Storage Compressibility of Pre-Trained ML Models but Were Afraid to AskProceedings of the VLDB Endowment10.14778/3659437.365945617:8(2036-2049)Online publication date: 31-May-2024
  • (2024)High-Performance Query Processing with NVMe Arrays: Spilling without Killing PerformanceProceedings of the ACM on Management of Data10.1145/36988132:6(1-27)Online publication date: 20-Dec-2024
  • (2024)NULLS!: Revisiting Null Representation in Modern Columnar FormatsProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663452(1-10)Online publication date: 10-Jun-2024
  • (2024)Accelerating GPU Data Processing using FastLanes CompressionProceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663450(1-11)Online publication date: 10-Jun-2024
  • (2024)On Tuning Raft for IoT Workload in Apache IoTDB2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00399(5307-5319)Online publication date: 13-May-2024
  • (2024)AdaEdge: A Dynamic Compression Selection Framework for Resource Constrained Devices2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00124(1506-1519)Online publication date: 13-May-2024
  • (2024)Towards Sustainability of AI – Identifying Design Patterns for Sustainable Machine Learning DevelopmentInformation Systems Frontiers10.1007/s10796-024-10526-6Online publication date: 16-Sep-2024
  • (2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media