[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Public Access

High-Ratio Compression for Machine-Generated Data

Published: 12 December 2023 Publication History

Abstract

Machine-generated data is rapidly growing and poses challenges for data-intensive systems, especially as the growth of data outpaces the growth of storage space. To cope with the storage issue, compression plays a critical role in storage engines, particularly for data-intensive applications, where a high compression ratio and efficient random access are essential. However, existing compression techniques tend to focus on general-purpose and data block approaches, but overlook the inherent structure of machine-generated data and hence result in low compression ratios or limited lookup efficiency. To address these limitations, we introduce the Pattern-Based Compression (PBC) algorithm, which specifically targets patterns in machine-generated data to achieve Pareto-optimality in most cases. Unlike traditional data block-based methods, PBC compresses data on a per-record basis, facilitating rapid random access. Our experimental evaluation demonstrates that PBC, on average, achieves a compression ratio twice as high as the state-of-the-art techniques while maintaining competitive compression and decompression speeds. We also integrate PBC to a production database system and achieve improvements on both comparison ratio and throughput.

References

[1]
Amazon. 2016. Amazon Ion. https://amazon-ion.github.io/
[2]
Apache. 2018. Apache ORC, High-Performance Columnar Storage for Hadoop. https://orc.apache.org
[3]
Shivnath Babu, Minos N. Garofalakis, and Rajeev Rastogi. 2001. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21--24, 2001. ACM, 283--294. https://doi.org/10.1145/375663.375693
[4]
Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, and Viktor Leis. 2018. HOT: A height optimized trie index for main-memory database systems. In Proceedings of the 2018 International Conference on Management of Data. 521--534.
[5]
Carsten Binnig, Stefan Hildenbrand, and Franz F"arber. 2009. Dictionary-based order-preserving string compression for main memory column stores. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 283--296.
[6]
Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: fast random access string compression. Proceedings of the VLDB Endowment, Vol. 13, 12 (2020), 2649--2661.
[7]
Carsten Bormann and Paul E. Hoffman. 2020. Concise Binary Object Representation (CBOR). RFC 8949. https://doi.org/10.17487/RFC8949
[8]
BSON. 2009. Binary JSON. https://bsonspec.org/.
[9]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data. In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA. USENIX Association, 205--218.
[10]
Robert Christensen and Feifei Li. 2013. Adaptive log compression for massive log data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27. ACM, 1283--1284.
[11]
Yann Collet and Murray Kucherawy. 2018. Zstandard Compression and the application/zstd Media Type. Technical Report. Amherst, MA, USA.
[12]
Kennon J Conrad and Paul R Wilson. 2016. Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios with LZ-Class Decompression Speed. In DCC. 586.
[13]
Sanjoy Dasgupta. 2016. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing. 118--127.
[14]
Scott Davies and Andrew W. Moore. 1999. Bayesian Networks for Lossless Dataset Compression. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 15--18, 1999. ACM, 387--391. https://doi.org/10.1145/312129.312289
[15]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6--8, 2004. USENIX Association, 137--150.
[16]
Jarek Duda, Khalid Tahboub, Neeraj J. Gadgil, and Edward J. Delp. 2015. The use of asymmetric numeral systems as an accurate replacement for Huffman coding. In 2015 Picture Coding Symposium, PCS 2015, Cairns, Australia, May 31 - June 3. IEEE, 65--69.
[17]
Dominik Durner, Viktor Leis, and Thomas Neumann. 2021. JSON Tiles: Fast Analytics on Semi-Structured Data. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 445--458.
[18]
Facebook. 2012. RocksDB, A persistent key-value store. http://rocksdb.org/
[19]
Apache Foundation. 2018. Apache CarbonData. https://carbondata.apache.org/
[20]
Sadayuki Furuhashi. 2012. MessagePack. https://msgpack.org/.
[21]
Yihan Gao and Aditya G. Parameswaran. 2016. Squish: Near-Optimal Compression for Archival of Relational Datasets. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016. ACM, 1575--1584. https://doi.org/10.1145/2939672.2939867
[22]
Bogdan Ghita, Diego G Tomé, and Peter A Boncz. 2020. White-box Compression: Learning and Exploiting Compact Table Representations. In CIDR.
[23]
Google. 2011a. LevelDB. https://github.com/google/leveldb
[24]
Google. 2011b. Snappy compression library. https://github.com/google/snappy
[25]
Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2020. Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).
[26]
David A. Huffman. 1952. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, Vol. 40, 9 (1952), 1098--1101.
[27]
Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, and Ugur cC etintemel. 2020. DeepSqueeze: Deep Semantic Compression for Tabular Data. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020. ACM, 1733--1746.
[28]
Insanity Industries. 2021. Pareto-optimal compression. https://insanity.industries/post/pareto-optimal-compression/.
[29]
H. V. Jagadish, J. Madar, and Raymond T. Ng. 1999. Semantic Compression and Pattern Extraction with Fascicles. In VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7--10, 1999, Edinburgh, Scotland, UK. 186--198.
[30]
H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, and Anthony K. H. Tung. 2004. ItCompress: An Iterative Semantic Compression Algorithm. In Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March - 2 April 2004, Boston, MA, USA. IEEE Computer Society, 646--657.
[31]
Mark Adler Jean-loup Gailly. 1992. GNU GZIP. https://www.gnu.org/software/gzip/
[32]
Hao Jiang, Chunwei Liu, Qi Jin, John Paparrizos, and Aaron J Elmore. 2020. Pids: attribute decomposition for improved compression and query performance in columnar storage. Proceedings of the VLDB Endowment, Vol. 13, 6 (2020), 925--938.
[33]
Raghu Krishnapuram, Krishna Prasad Chitrapura, and Sachindra Joshi. 2003. Classification of Text Documents Based on Minimum System Entropy. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21--24, Washington, DC, USA. 384--391.
[34]
N. Jesper Larsson and Alistair Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE, Vol. 88, 11 (2000), 1722--1732.
[35]
Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 38--49.
[36]
Hao Lin, Jingyu Zhou, Bin Yao, Minyi Guo, and Jie Li. 2015. Cowic: A Column-Wise Independent Compression for Log Stream Analysis. In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4--7. 21--30.
[37]
Jinyang Liu, Jieming Zhu, Shilin He, Pinjia He, Zibin Zheng, and Michael R Lyu. 2019. Logzip: extracting hidden structures via iterative clustering for log compression. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 863--873.
[38]
MongoDB Inc. 2009. MongoDB. https://www.mongodb.com/.
[39]
Ingo Mü ller, Cornelius Ratsch, and Franz F"a rber. 2014. Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems. In Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 24--28, Sihem Amer-Yahia, Vassilis Christophides, Anastasios Kementsietsidis, Minos N. Garofalakis, Stratos Idreos, and Vincent Leroy (Eds.). 283--294.
[40]
Craig G. Nevill-Manning and Ian H. Witten. 1997. Linear-time, Incremental Hierarchy Inference for Compression. In Proceedings of the 7th Data Compression Conference (DCC '97), Snowbird, Utah, USA, March 25--27. IEEE Computer Society, 3--11.
[41]
John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian, Aaron J Elmore, Michael J Franklin, and Sanjay Krishnan. 2021. VergeDB: A Database for IoT Analytics on Edge Devices. In CIDR.
[42]
R. Pasco. 1977. Source coding algorithms for fast data compression (Ph.D. Thesis abstr.). IEEE Transactions on Information Theory, Vol. 23, 4 (1977), 548--548.
[43]
Mark Raasveldt and Hannes Muehleisen. 2019. DuckDB. https://github.com/duckdb/duckdb.
[44]
Vijayshankar Raman and Garret Swart. 2006. How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. ACM, 858--869.
[45]
Redis. 2009. Redis. https://redis.io/.
[46]
Roman Dementiev et al. Robert Lasch, Ismail Oukid. 2019. Fast & Strong: The Case of Compressed String Dictionaries on Modern CPUs. In Proceedings of the 15th International Workshop on Data Management on New Hardware, DaMoN 2019, Amsterdam, The Netherlands, 1 July. ACM, 4:1--4:10.
[47]
David Reinsel-John Gantz-John Rydning, J Reinsel, and J Gantz. 2018. The digitization of the world from edge to core. Framingham: International Data Corporation, Vol. 16 (2018).
[48]
Amazon Web Services. 2012. Redshift Database. https://aws.amazon.com/redshift/
[49]
Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J., Vol. 27, 3 (1948), 379--423.
[50]
Richard Shurtz. 2019. JSON COMPRESSION: ALTERNATIVE BINARY FORMATS AND COMPRESSION METHODS. https://www.lucidchart.com/techblog/2019/12/06/json-compression-alternative-binary-formats-and-compression-methods/
[51]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, Lake Tahoe, Nevada, USA, May 3--7. 1--10.
[52]
P. Skibi'ski. 2023. In-memory benchmark of open-source LZ77/LZSS/LZMA compressors. https://github.com/inikep/lzbench.
[53]
Juan Cruz Viotti and Mital Kinderkhedia. 2022a. A Benchmark of JSON-compatible Binary Serialization Specifications. arxiv: 2201.03051 [cs.SE]
[54]
Juan Cruz Viotti and Mital Kinderkhedia. 2022b. Benchmarking JSON BinPack. arxiv: 2211.12799 [cs.SE]
[55]
Juan Cruz Viotti and Mital Kinderkhedia. 2022c. A Survey of JSON-compatible Binary Serialization Specifications. arxiv: 2201.02089 [cs.DB]
[56]
Deepak Vohra. 2016. Apache parquet. In Practical Hadoop Ecosystem. Springer, 325--335.
[57]
Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern $$CPUs$$. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 631--648.
[58]
Junyu Wei, Guangyan Zhang, Yang Wang, Zhiwei Liu, Zhanyang Zhu, Junchao Chen, Tingtao Sun, and Qi Zhou. 2021. On the Feasibility of Parser-based Log Compression in $$Large-Scale$$ Cloud Systems. In 19th USENIX Conference on File and Storage Technologies (FAST 2021). 249--262.
[59]
Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. 2000. The Implementation and Performance of Compressed Databases. SIGMOD Rec., Vol. 29, 3 (2000), 55--67.
[60]
Wikipedia. 2023 a. Deflate Algorithm. https://en.wikipedia.org/w/index.php?title=Deflate&oldid=1148886022
[61]
Wikipedia. 2023 b. Lempel--Ziv--Markov chain algorithm -- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm
[62]
Wikipedia. 2023 c. Machine-generated Data. https://en.wikipedia.org/wiki/Machine-generated_data
[63]
Wikipedia. 2023. Zstandard. https://en.wikipedia.org/wiki/Zstd.
[64]
Collet Yann. 2011. LZ4. https://lz4.github.io/lz4/
[65]
Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. Surf: Practical range query filtering with fast succinct tries. In Proceedings of the 2018 International Conference on Management of Data. 323--336.
[66]
Huanchen Zhang, Xiaoxuan Liu, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2020. Order-preserving key compression for in-memory search trees. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1601--1615.
[67]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory, Vol. 23, 3 (1977), 337--343.
[68]
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory, Vol. 24, 5 (1978), 530--536.

Cited By

View all
  • (2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
  • (2024)CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement LearningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674809(269-278)Online publication date: 24-Jul-2024
  • (2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 4
PACMMOD
December 2023
1317 pages
EISSN:2836-6573
DOI:10.1145/3637468
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023
Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Author Tags

  1. data compression
  2. machine-generated data

Qualifiers

  • Research-article

Funding Sources

  • JSPS Kakenhi
  • National Key R&D Program of China
  • ARC Future Fellowship
  • NSFC
  • the scholarship of China Scholarship Council
  • CCF-AFSG Research Fund
  • ARC Discovery Project
  • CREST
  • GuangDong Basic and Applied Basic Research Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)297
  • Downloads (Last 6 weeks)37
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
  • (2024)CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement LearningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674809(269-278)Online publication date: 24-Jul-2024
  • (2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
  • (2023)A Factor Marginal Effect Analysis Approach and Its Application in E-Commerce Search SystemInternational Journal of Intelligent Systems10.1155/2023/69688542023Online publication date: 11-Oct-2023
  • (2023)F3KM: Federated, Fair, and Fast k-meansProceedings of the ACM on Management of Data10.1145/36267281:4(1-25)Online publication date: 12-Dec-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media