More Web Proxy on the site http://driver.im/

research-article

Public Access

High-Ratio Compression for Machine-Generated Data

Authors:

Xuemin LinAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 245, Pages 1 - 27

https://doi.org/10.1145/3626732

Published: 12 December 2023 Publication History

Abstract

Machine-generated data is rapidly growing and poses challenges for data-intensive systems, especially as the growth of data outpaces the growth of storage space. To cope with the storage issue, compression plays a critical role in storage engines, particularly for data-intensive applications, where a high compression ratio and efficient random access are essential. However, existing compression techniques tend to focus on general-purpose and data block approaches, but overlook the inherent structure of machine-generated data and hence result in low compression ratios or limited lookup efficiency. To address these limitations, we introduce the Pattern-Based Compression (PBC) algorithm, which specifically targets patterns in machine-generated data to achieve Pareto-optimality in most cases. Unlike traditional data block-based methods, PBC compresses data on a per-record basis, facilitating rapid random access. Our experimental evaluation demonstrates that PBC, on average, achieves a compression ratio twice as high as the state-of-the-art techniques while maintaining competitive compression and decompression speeds. We also integrate PBC to a production database system and achieve improvements on both comparison ratio and throughput.

References

[1]

Amazon. 2016. Amazon Ion. https://amazon-ion.github.io/

[2]

Apache. 2018. Apache ORC, High-Performance Columnar Storage for Hadoop. https://orc.apache.org

[3]

Shivnath Babu, Minos N. Garofalakis, and Rajeev Rastogi. 2001. SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21--24, 2001. ACM, 283--294. https://doi.org/10.1145/375663.375693

Digital Library

[4]

Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, and Viktor Leis. 2018. HOT: A height optimized trie index for main-memory database systems. In Proceedings of the 2018 International Conference on Management of Data. 521--534.

Digital Library

[5]

Carsten Binnig, Stefan Hildenbrand, and Franz F"arber. 2009. Dictionary-based order-preserving string compression for main memory column stores. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. 283--296.

Digital Library

[6]

Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: fast random access string compression. Proceedings of the VLDB Endowment, Vol. 13, 12 (2020), 2649--2661.

Digital Library

[7]

Carsten Bormann and Paul E. Hoffman. 2020. Concise Binary Object Representation (CBOR). RFC 8949. https://doi.org/10.17487/RFC8949

Digital Library

[8]

BSON. 2009. Binary JSON. https://bsonspec.org/.

[9]

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data. In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA. USENIX Association, 205--218.

[10]

Robert Christensen and Feifei Li. 2013. Adaptive log compression for massive log data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27. ACM, 1283--1284.

Digital Library

[11]

Yann Collet and Murray Kucherawy. 2018. Zstandard Compression and the application/zstd Media Type. Technical Report. Amherst, MA, USA.

[12]

Kennon J Conrad and Paul R Wilson. 2016. Grammatical Ziv-Lempel Compression: Achieving PPM-Class Text Compression Ratios with LZ-Class Decompression Speed. In DCC. 586.

[13]

Sanjoy Dasgupta. 2016. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing. 118--127.

Digital Library

[14]

Scott Davies and Andrew W. Moore. 1999. Bayesian Networks for Lossless Dataset Compression. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 15--18, 1999. ACM, 387--391. https://doi.org/10.1145/312129.312289

Digital Library

[15]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6--8, 2004. USENIX Association, 137--150.

[16]

Jarek Duda, Khalid Tahboub, Neeraj J. Gadgil, and Edward J. Delp. 2015. The use of asymmetric numeral systems as an accurate replacement for Huffman coding. In 2015 Picture Coding Symposium, PCS 2015, Cairns, Australia, May 31 - June 3. IEEE, 65--69.

[17]

Dominik Durner, Viktor Leis, and Thomas Neumann. 2021. JSON Tiles: Fast Analytics on Semi-Structured Data. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 445--458.

[18]

Facebook. 2012. RocksDB, A persistent key-value store. http://rocksdb.org/

[19]

Apache Foundation. 2018. Apache CarbonData. https://carbondata.apache.org/

[20]

Sadayuki Furuhashi. 2012. MessagePack. https://msgpack.org/.

[21]

Yihan Gao and Aditya G. Parameswaran. 2016. Squish: Near-Optimal Compression for Archival of Relational Datasets. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016. ACM, 1575--1584. https://doi.org/10.1145/2939672.2939867

Digital Library

[22]

Bogdan Ghita, Diego G Tomé, and Peter A Boncz. 2020. White-box Compression: Learning and Exploiting Compact Table Representations. In CIDR.

[23]

Google. 2011a. LevelDB. https://github.com/google/leveldb

[24]

Google. 2011b. Snappy compression library. https://github.com/google/snappy

[25]

Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2020. Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).

[26]

David A. Huffman. 1952. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, Vol. 40, 9 (1952), 1098--1101.

[27]

Amir Ilkhechi, Andrew Crotty, Alex Galakatos, Yicong Mao, Grace Fan, Xiran Shi, and Ugur cC etintemel. 2020. DeepSqueeze: Deep Semantic Compression for Tabular Data. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14--19, 2020. ACM, 1733--1746.

Digital Library

[28]

Insanity Industries. 2021. Pareto-optimal compression. https://insanity.industries/post/pareto-optimal-compression/.

[29]

H. V. Jagadish, J. Madar, and Raymond T. Ng. 1999. Semantic Compression and Pattern Extraction with Fascicles. In VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7--10, 1999, Edinburgh, Scotland, UK. 186--198.

[30]

H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, and Anthony K. H. Tung. 2004. ItCompress: An Iterative Semantic Compression Algorithm. In Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March - 2 April 2004, Boston, MA, USA. IEEE Computer Society, 646--657.

[31]

Mark Adler Jean-loup Gailly. 1992. GNU GZIP. https://www.gnu.org/software/gzip/

[32]

Hao Jiang, Chunwei Liu, Qi Jin, John Paparrizos, and Aaron J Elmore. 2020. Pids: attribute decomposition for improved compression and query performance in columnar storage. Proceedings of the VLDB Endowment, Vol. 13, 6 (2020), 925--938.

Digital Library

[33]

Raghu Krishnapuram, Krishna Prasad Chitrapura, and Sachindra Joshi. 2003. Classification of Text Documents Based on Minimum System Entropy. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21--24, Washington, DC, USA. 384--391.

[34]

N. Jesper Larsson and Alistair Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE, Vol. 88, 11 (2000), 1722--1732.

Digital Library

[35]

Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 38--49.

Digital Library

[36]

Hao Lin, Jingyu Zhou, Bin Yao, Minyi Guo, and Jie Li. 2015. Cowic: A Column-Wise Independent Compression for Log Stream Analysis. In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2015, Shenzhen, China, May 4--7. 21--30.

[37]

Jinyang Liu, Jieming Zhu, Shilin He, Pinjia He, Zibin Zheng, and Michael R Lyu. 2019. Logzip: extracting hidden structures via iterative clustering for log compression. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 863--873.

Digital Library

[38]

MongoDB Inc. 2009. MongoDB. https://www.mongodb.com/.

[39]

Ingo Mü ller, Cornelius Ratsch, and Franz F"a rber. 2014. Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems. In Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 24--28, Sihem Amer-Yahia, Vassilis Christophides, Anastasios Kementsietsidis, Minos N. Garofalakis, Stratos Idreos, and Vincent Leroy (Eds.). 283--294.

[40]

Craig G. Nevill-Manning and Ian H. Witten. 1997. Linear-time, Incremental Hierarchy Inference for Compression. In Proceedings of the 7th Data Compression Conference (DCC '97), Snowbird, Utah, USA, March 25--27. IEEE Computer Society, 3--11.

[41]

John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian, Aaron J Elmore, Michael J Franklin, and Sanjay Krishnan. 2021. VergeDB: A Database for IoT Analytics on Edge Devices. In CIDR.

[42]

R. Pasco. 1977. Source coding algorithms for fast data compression (Ph.D. Thesis abstr.). IEEE Transactions on Information Theory, Vol. 23, 4 (1977), 548--548.

Digital Library

[43]

Mark Raasveldt and Hannes Muehleisen. 2019. DuckDB. https://github.com/duckdb/duckdb.

[44]

Vijayshankar Raman and Garret Swart. 2006. How to Wring a Table Dry: Entropy Compression of Relations and Querying of Compressed Relations. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. ACM, 858--869.

[45]

Redis. 2009. Redis. https://redis.io/.

[46]

Roman Dementiev et al. Robert Lasch, Ismail Oukid. 2019. Fast & Strong: The Case of Compressed String Dictionaries on Modern CPUs. In Proceedings of the 15th International Workshop on Data Management on New Hardware, DaMoN 2019, Amsterdam, The Netherlands, 1 July. ACM, 4:1--4:10.

[47]

David Reinsel-John Gantz-John Rydning, J Reinsel, and J Gantz. 2018. The digitization of the world from edge to core. Framingham: International Data Corporation, Vol. 16 (2018).

[48]

Amazon Web Services. 2012. Redshift Database. https://aws.amazon.com/redshift/

[49]

Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst. Tech. J., Vol. 27, 3 (1948), 379--423.

[50]

Richard Shurtz. 2019. JSON COMPRESSION: ALTERNATIVE BINARY FORMATS AND COMPRESSION METHODS. https://www.lucidchart.com/techblog/2019/12/06/json-compression-alternative-binary-formats-and-compression-methods/

[51]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, Lake Tahoe, Nevada, USA, May 3--7. 1--10.

[52]

P. Skibi'ski. 2023. In-memory benchmark of open-source LZ77/LZSS/LZMA compressors. https://github.com/inikep/lzbench.

[53]

Juan Cruz Viotti and Mital Kinderkhedia. 2022a. A Benchmark of JSON-compatible Binary Serialization Specifications. arxiv: 2201.03051 [cs.SE]

[54]

Juan Cruz Viotti and Mital Kinderkhedia. 2022b. Benchmarking JSON BinPack. arxiv: 2211.12799 [cs.SE]

[55]

Juan Cruz Viotti and Mital Kinderkhedia. 2022c. A Survey of JSON-compatible Binary Serialization Specifications. arxiv: 2201.02089 [cs.DB]

[56]

Deepak Vohra. 2016. Apache parquet. In Practical Hadoop Ecosystem. Springer, 325--335.

[57]

Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-pattern Regex Matcher for Modern $$CPUs$$. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 631--648.

[58]

Junyu Wei, Guangyan Zhang, Yang Wang, Zhiwei Liu, Zhanyang Zhu, Junchao Chen, Tingtao Sun, and Qi Zhou. 2021. On the Feasibility of Parser-based Log Compression in $$Large-Scale$$ Cloud Systems. In 19th USENIX Conference on File and Storage Technologies (FAST 2021). 249--262.

[59]

Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. 2000. The Implementation and Performance of Compressed Databases. SIGMOD Rec., Vol. 29, 3 (2000), 55--67.

Digital Library

[60]

Wikipedia. 2023 a. Deflate Algorithm. https://en.wikipedia.org/w/index.php?title=Deflate&oldid=1148886022

[61]

Wikipedia. 2023 b. Lempel--Ziv--Markov chain algorithm -- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm

[62]

Wikipedia. 2023 c. Machine-generated Data. https://en.wikipedia.org/wiki/Machine-generated_data

[63]

Wikipedia. 2023. Zstandard. https://en.wikipedia.org/wiki/Zstd.

[64]

Collet Yann. 2011. LZ4. https://lz4.github.io/lz4/

[65]

Huanchen Zhang, Hyeontaek Lim, Viktor Leis, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2018. Surf: Practical range query filtering with fast succinct tries. In Proceedings of the 2018 International Conference on Management of Data. 323--336.

Digital Library

[66]

Huanchen Zhang, Xiaoxuan Liu, David G Andersen, Michael Kaminsky, Kimberly Keeton, and Andrew Pavlo. 2020. Order-preserving key compression for in-memory search trees. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1601--1615.

Digital Library

[67]

Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on information theory, Vol. 23, 3 (1977), 337--343.

Digital Library

[68]

Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory, Vol. 24, 5 (1978), 530--536.

Digital Library

Cited By

Gao SLi YShen YShao YChen L(2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641215
Mai GHe ZYu GChen ZChen P(2024)CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement LearningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674809(269-278)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674809
Gao SLi YZhang XShen YShao YChen L(2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654977
Show More Cited By

Recommendations

Second compression for pixelated images under edge-based compression algorithms: JPEG-LS as an example

This paper details the examination of a particular case of data compression, where the compression algorithm removes the redundancy from data, which occurs when edge-based compression algorithms compress (previously compressed) pixelated images. The newly ...
Vertex Data Compression through Vector Quantization

Rendering geometrically detailed 3D models requires the transfer and processing of large amounts of triangle and vertex geometry data. Compressing the geometry bitstream can reduce bandwidth requirements and alleviate transmission bottlenecks. In this ...
Lossless compression of already compressed textures
HPG '11: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics

Texture compression helps rendering by reducing the footprint in graphics memory, thus allowing for more textures, and by lowering the number of memory accesses between the graphics processor and memory, increasing performance and lowering power ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

JSPS Kakenhi
National Key R&D Program of China
ARC Future Fellowship
NSFC
the scholarship of China Scholarship Council
CCF-AFSG Research Fund
ARC Discovery Project
CREST
GuangDong Basic and Applied Basic Research Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
310
Total Downloads

Downloads (Last 12 months)297
Downloads (Last 6 weeks)37

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gao SLi YShen YShao YChen L(2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641215
Mai GHe ZYu GChen ZChen P(2024)CTuner: Automatic NoSQL Database Tuning with Causal Reinforcement LearningProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674809(269-278)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674809
Gao SLi YZhang XShen YShao YChen L(2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654977
Wang YSachurengui SZhang DAziguli Wulamu Bao H(2023)A Factor Marginal Effect Analysis Approach and Its Application in E-Commerce Search SystemInternational Journal of Intelligent Systems10.1155/2023/69688542023Online publication date: 11-Oct-2023
https://dl.acm.org/doi/10.1155/2023/6968854
Zhu SXu QZeng JWang SSun YYang ZYang CPeng Z(2023)F3KM: Federated, Fair, and Fast k-meansProceedings of the ACM on Management of Data10.1145/36267281:4(1-25)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626728

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents