[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A meta-heuristic approach for RLE compression in a column store table

Published: 01 June 2019 Publication History

Abstract

Structured data are one of the most important segments in the realm of big data analysis that have undeniably prevailed over the years. In recent years, column-oriented design has become a frequent practice to organize structured data in analytical systems. The storage systems that organize data in a column-wise manner are often referred to as column stores. Column-oriented databases or warehouses and spreadsheet applications in particular have recently become a popular and a convenient tool for column-wise data processing and analysis. At the same time, the volume of data is increasing at an extreme rate, which despite the decrease in pricing of storage systems stresses the importance of data compression. Apart from resounding performance gain in large read-mostly data repositories, column-oriented data are easily compressible, which enables efficient query processing and pushes the peak of the overall performance. Many compression algorithms, including the Run Length Encoding (RLE), exploit the similarity among the column values, where repetitions of the same value form columnar runs that can be found in most database systems. This paper presents a comprehensive analysis and comparison of common and well-known meta-heuristics for columnar run minimization, based on standard implementations by using real datasets. We have analyzed genetic algorithms, simulated annealing, cuckoo search, particle swarm optimization, Tabu search, and the bat algorithm. The first three being the most efficient have undergone sensitivity analysis on synthetic datasets to fine-tune their parameters. These meta-heuristics were then tested on real datasets. The experiments show that the algorithms perform consistently well on both synthetic and real data, demonstrating higher run-reduction efficiency compared to existing approaches. Moreover, the results show that the applied meta-heuristics exhibit quick convergence to nearly optimal solutions, accompanied by an insignificant overhead. In addition, we provide comprehensive implementations of the heuristic RLE compression approaches based on common optimization methods. They are effective at physical compression to an extent that makes them suitable as everyday appliances. The experiments on real datasets also indicate that our implementations are able overcome the expected on-disk file compression ratio, in most cases being better than the respective reduction in runs.

References

[1]
Choosing sort order: best practices--vertica online documentation (2015). http://web.archive.org/web/20080207010024/; http://www.808multimedia.com/winnt/kernel.htm. Accessed 27 March 2016.
[2]
Abadi DJ, Madden SR, Ferreira MC (2006) Integrating compression and execution in column-oriented database systems. In: Proceedings of the ACM SIGMOD international conference on management of data, Chicago, USA, pp 671-682.
[3]
Abadi D, Boncz P, Harizopoulos S, Idreos S, Madden S (2013) The design and implementation of modern column-oriented database systems. Found Trends Databases 5(3):197-280.
[4]
Arsov N, Siljanoska Simons M, Jovanovski J (2014a) Genetic AlgorithmRLE_Cpp: genetic algorithm compression of tabular data. https://github.com/ninoarsov/GeneticAlgorithmRLE_Cpp. Accessed 20 Nov 2014.
[5]
Arsov N, Siljanoska Simons M, Jovanovski J (2014b) GeneticAlgorithmRLE_ LOExtension git hub repository: a libre office calc extension implementing a genetic algorithm for RLE compression of CSV data. https://github.com/ninoarsov/GeneticAlgorithmRLE_LOExtension. Accessed 10 June 2014.
[6]
Arulraj J, Pavlo A, Menon P (2016) Bridging the archipelago between row-stores and column-stores for hybrid workloads. In: Proceedings of the 2016 international conference on management of data, ACM, pp 583-598.
[7]
Bäck T, Fogel DB, Michalewicz Z (2000) Evolutionary computation 1: basic algorithms and operators. Institute of Physics Publishing, Bristol.
[8]
Chang CC, Chen YH, Lin CC (2009) A data embedding scheme for color images based on genetic algorithm and absolute moment block truncation coding. Soft Comput 13(4):321-331.
[9]
Copeland GP, Khoshafian S (1985) A decomposition storage model. In: Proceedings of the ACM SIGMOD conference on management of data, Austin, Texas, pp 268-279.
[10]
Dean J (2009) Challenges in building large-scale information retrieval systems. In: Keynote of the 2nd ACMinternational conference on web search and data mining (WSDM).
[11]
Deng W, Zhao H, Zou L, Li G, Yang X, Wu D (2017) A novel collaborative optimization algorithm in solving complex optimization problems. Soft Comput 21(15):4387-4398.
[12]
Dorigo M, Birattari M, Stutzle T (2006) Ant colony optimization. IEEE Comput Intell Mag 1(4):28-39.
[13]
Eavis T, Cueva D (2007) A hilbert space compression architecture for data warehouse environments. In: Song I, Eder J, Nguyen T (eds) Data warehousing and knowledge discovery, vol 4654. Lecture notes in computer science. Springer-Verlag, Berlin, Heidelberg, pp 1-12.
[14]
Edwards G (2010) Nova Scotia GenWeb Project, Cumberland County GenWeb. http://www.rootsweb.ancestry.com/~nscumber/sources.html. Accessed 10 June 2014.
[15]
Fenwick PM (1994) A new data structure for cumulative frequency tables. Softw Pract Exp 24(3):327-336.
[16]
Fisher RA, Yates F et al (1938) Statistical tables for biological, agricultural and medical research. Oliver and Boyd, Edinburgh.
[17]
Geem ZW, Kim JH, Loganathan GV (2001) A new heuristic optimization algorithm: harmony search. Simulation 76(2):60-68.
[18]
Glover F, Laguna M (2013) Tabu search. In: Pardalos PM, DuD-Z, Graham RL (eds) Handbook of combinatorial optimization. Springer, New York, 3261-3362.
[19]
Hoffer J, Severance D (1975) The use of cluster analysis in physical data base design. In: Proceedings of VLDB, Framingham, USA, pp 69-86.
[20]
Holsheimer M, Kersten ML (1994) Architectural support for data mining. In: Fayyad U, Uthurusamy R (eds) Knowledge discovery in databases: papers from the 1994 AAAI Workshop, Technical Report WS-94-03. AAAI Press, Seattle, Washington, USA, pp 217-228.
[21]
Houkjær L, Torp K, Wind R (2006) Simple and realistic data generation. In: Proceedings of the 32nd international conference on very large data bases, Seoul, Korea, pp 1243-1246.
[22]
Iyoda EM, Shibata T, Nobuhara H, Pedrycz W, Hirota K (2007) Image compression and reconstruction using pi t-sigma neural networks. Soft Comput 11(1):53-61.
[23]
Ji Z, Zhou J, Zhu Z, Chen S (2013) Self-configuration single particle optimizer for DNA sequence compression. Soft Comput 17(4):675-682.
[24]
Jovanovski J, Siljanoska M, Velinov G (2013) A genetic algorithm approach for minimizing the number of columnar runs in a column store table. In: 11th international conference on adaptive and natural computing algorithms, ICANNGA 2013, Lausanne, Switzerland, April 4-6, 2013. Proceedings, pp 485-494.
[25]
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: IEEE international conference on neural networks, 1995, Proceedings, vol 4, pp 1942-1948.
[26]
Khachaturyan A, Semenovskaya S, Vainstein B (1979) A statistical-thermodynamic approach to determination of structure amplitude phases. Sov Phys Crystallogr 24:519-524.
[27]
Kirkpatrick S, Vecchi MP et al (1983) Optimization by simmulated annealing. Science 220(4598):671-680.
[28]
Knuth DE (1969) The art of computer programming, vot; ime 2: seminumerical algorithms. Addison-Wesley, Reading, Massachusetts.
[29]
Lamb A, Fuller M, Varadarajan R, Tran N, Vandiver B, Doshi L, Bear C (2012) The vertica analytic database: C-store 7 years later. Proc VLDB Endow 5(12):1790-1801.
[30]
Larson PÅ, Hanson EN, Price SL (2012) Columnar storage in SQL server 2012. IEEE Data Eng Bull 35(1):15-20.
[31]
Larson PÅ, Clinciu C, Fraser C, Hanson EN, Mokhtar M, Nowakiewicz M, Papadimos V, Price SL, Rangarajan S, Rusanu R et al (2013) Enhancements to sql server column stores. In: Proceedings of the 2013 ACMSIGMOD international conference on management of data, ACM, pp 1159-1168.
[32]
Lee KY, El-Sharkawi MA (2008) Modern heuristic optimization techniques: theory and applications to power systems, 3rd edn. Wiley-IEEE Press, Hoboken.
[33]
Lemire D, Kaser O (2011) Reordering columns for smaller indexes. Int J Inf Sci 181(12):2550-2570.
[34]
Lemire D, Kaser O, Gutarra E (2012) Reordering rows for better compression: beyond the lexicographic order. ACM Trans DB Syst 37(3):2550-2570.
[35]
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 12 Feb 2018.
[36]
Liu J, Qiao S (2015) A image segmentation algorithm based on differential evolution particle swarm optimization fuzzy c-means clustering. Comput Sci Inf Syst 12(2):873-893.
[37]
López LFM, Blas NG, Albert AA (2017) Multidimensional knapsack problem optimization using a binary particle swarm model with genetic operations. Soft Comput 1-16.
[38]
Luo Q, Ma M, Zhou Y (2016) A novel animal migration algorithm for global numerical optimization. Comput Sci Inf Syst 13(1):259- 285.
[39]
Ma H, Ye S, Simon D, Fei M (2017) Conceptual and numerical comparisons of swarm intelligence optimization algorithms. Soft Comput 21(11):3081-3100.
[40]
Marz N, Warren J (2015) Big data: principles and best practices of scalable realtime data systems. Manning Publications Co, Greenwich.
[41]
Mati? D, Kratica J, Filipovi? V (2017) Variable neighborhood search for solving bandwidth coloring problem. Comput Sci Inf Syst 14(2):309-327.
[42]
Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51-67.
[43]
Mitchell M (1998) An introduction to genetic algorithms, 3rd edn. The MIT Press, Cambridge.
[44]
O'Neil P, Quass D (1997) Improved query performance with variant indexes. In: Proceedings of the ACM SIGMOD international conference on management of data, Tucson, USA, pp 38-49.
[45]
Osaba E, Yang XS, Diaz F, Lopez-Garcia P, Carballedo R (2016) An improved discrete bat algorithm for symmetric and asymmetric traveling salesman problems. Eng Appl Artif Intell 48:59-71.
[46]
Ouaarab A, Ahiod B, Yang XS (2014) Discrete cuckoo search algorithm for the travelling salesman problem. Neural Comput Appl 24(7- 8):1659-1669.
[47]
Ouaarab A, Ahiod B, Yang XS (2015) Random-key cuckoo search for the travelling salesman problem. Soft Comput 19(4):1099-1106.
[48]
Sanchez IAL, Vargas JM, Santos CA, Mendoza MG, Moctezuma CJM (2017) Solving binary cutting stock with matheuristics using particle swarm optimization and simulated annealing. Soft Comput.
[49]
Sivanandam SN, Deepa SN (2008) Introduction to genetic algorithms. Springer, Berlin.
[50]
Tan L, Sun J, Tong X (2014) A hybrid particle swarm optimization based memetic algorithm for DNA sequence compression. Soft Comput 19(5):1255-1268.
[51]
Tang D, Liu T, Lee R, Liu H, Li W (2015) A case study of optimizing big data analytical stacks using structured data shuffling. In: 2015 IEEE international conference on cluster computing (CLUSTER), IEEE, pp 70-73.
[52]
Wang GG, Deb S, Gandomi AH, Zhang Z, Alavi AH (2016) Chaotic cuckoo search. Soft Comput 20(9):3349-3362.
[53]
Yang XS (2010) A new metaheuristic bat-inspired algorithm. González JR, Pelta DA, Cruz C, Terrazas G, Krasnogor N (eds) Nature Inspired Cooperative Strategies for Optimization (NICSO 2010). Studies in Computational Intelligence, vol 284. Springer, Berlin, Heidelberg, pp 65-74.
[54]
Yang XS, Deb S (2009) Cuckoo search via lévy flights. In: World congress on nature and biologically inspired computing, 2009. NaBIC 2009. IEEE, pp 210-214.
[55]
Yimin C, Yixiao W, Qibin S, Longxiang S (1999) Digital image compression using a genetic algorithm. Real Time Imaging 5(6):379- 383.

Cited By

View all
  • (2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
  • (2023)Schema-based Column Reordering for Dremel-encoded DataProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3579142.3594286(1-7)Online publication date: 18-Jun-2023
  • (2022)Ameliorating data compression and query performance through cracked ParquetProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532923(1-7)Online publication date: 12-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Soft Computing - A Fusion of Foundations, Methodologies and Applications
Soft Computing - A Fusion of Foundations, Methodologies and Applications  Volume 23, Issue 12
June 2019
662 pages
ISSN:1432-7643
EISSN:1433-7479
Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 June 2019

Author Tags

  1. Column stores
  2. Columnar runs
  3. Meta-heuristic optimization
  4. On-disk size compression
  5. RLE compression
  6. Run reduction

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion PipelinesProceedings of the VLDB Endowment10.14778/3681954.368201317:11(3456-3469)Online publication date: 1-Jul-2024
  • (2023)Schema-based Column Reordering for Dremel-encoded DataProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3579142.3594286(1-7)Online publication date: 18-Jun-2023
  • (2022)Ameliorating data compression and query performance through cracked ParquetProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3530050.3532923(1-7)Online publication date: 12-Jun-2022
  • (2020)Column Partition and Permutation for Run Length Encoding in Columnar DatabasesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3384413(2873-2874)Online publication date: 11-Jun-2020
  • (2020)An Algebraic Approach for the Search Space of Permutations with RepetitionEvolutionary Computation in Combinatorial Optimization10.1007/978-3-030-43680-3_2(18-34)Online publication date: 15-Apr-2020

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media