[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
survey

A Study on Garbage Collection Algorithms for Big Data Environments

Published: 10 January 2018 Publication History

Abstract

The need to process and store massive amounts of data—Big Data—is a reality. In areas such as scientific experiments, social networks management, credit card fraud detection, targeted advertisement, and financial analysis, massive amounts of information are generated and processed daily to extract valuable, summarized information. Due to its fast development cycle (i.e., less expensive to develop), mainly because of automatic memory management, and rich community resources, managed object-oriented programming languages (e.g., Java) are the first choice to develop Big Data platforms (e.g., Cassandra, Spark) on which such Big Data applications are executed.
However, automatic memory management comes at a cost. This cost is introduced by the garbage collector, which is responsible for collecting objects that are no longer being used. Although current (classic) garbage collection algorithms may be applicable to small-scale applications, these algorithms are not appropriate for large-scale Big Data environments, as they do not scale in terms of throughput and pause times.
In this work, current Big Data platforms and their memory profiles are studied to understand why classic algorithms (which are still the most commonly used) are not appropriate, and also to analyze recently proposed and relevant memory management algorithms, targeted to Big Data environments. The scalability of recent memory management algorithms is characterized in terms of throughput (improves the throughput of the application) and pause time (reduces the latency of the application) when compared to classic algorithms. The study is concluded by presenting a taxonomy of the described works and some open problems, with regard to Big Data memory management, that could be addressed in future works.

References

[1]
Rajendra Akerkar. 2013. Big Data Computing. CRC Press, Boca Raton, FL.
[2]
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at Internet scale. Proceedings of the VLDB Endowment 6, 11, 1033--1044.
[3]
Bowen Alpern, C. Richard Attanasio, John J. Barton, Michael G. Burke, Perry Cheng, J.-D. Choi, Anthony Cocchi, et al. 2000. The Jalapeno virtual machine. IBM Systems Journal 39, 1 (2000), 211--238.
[4]
Andrew W. Appel. 1989. Simple generational garbage collection and fast allocation. Software: Practice and Experience 19, 2, 171--183.
[5]
Henry G. Baker Jr. 1978. List processing in real time on a serial computer. Communications of the ACM 21, 4, 280--294.
[6]
Peter B. Bishop. 1977. Computer Systems With a Very Large Address Space and Garbage Collection. Technical Report. DTIC Document.
[7]
S. Blackburn, P. Cheng, and K. McKinley. 2004. Oil and water? High performance garbage collection in Java with MMTk. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04). 137--146.
[8]
Stephen M. Blackburn and Kathryn S. McKinley. 2003. Ulterior reference counting: Fast garbage collection without a long wait. In Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA’03). ACM, New York, NY, 344--358.
[9]
Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. IEEE, Los Alamitos, CA, 1151--1162.
[10]
Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, et al. 2011. Apache Hadoop goes realtime at Facebook. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1071--1080.
[11]
Don Box and Ted Pattison. 2002. Essential .NET: The Common Language Runtime. Addison-Wesley Longman Publishing Co., Inc.
[12]
Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira. 2017. NG2C: Pretenuring garbage collection with dynamic generations for hotspot big data applications. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM’17). ACM, New York, NY, 2--13.
[13]
Randal Bryant, Randy H. Katz, and Edward D. Lazowska. 2008. Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science, and Society. Computing Community Consortium.
[14]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26, 2, 4.
[15]
Kristina Chodorow. 2013. MongoDB: The Definitive Guide. O’Reilly Media, Inc.
[16]
Daniel Clifford, Hannes Payer, Michael Stanton, and Ben L. Titzer. 2015. Memento mori: Dynamic allocation-site-based optimizations. In Proceedings of the 2015 International Symposium on Memory Management (ISMM’15). ACM, New York, NY, 105--117.
[17]
Nachshon Cohen and Erez Petrank. 2015. Data structure aware garbage collector. ACM SIGPLAN Notices 50, 28--40.
[18]
George E. Collins. 1960. A method for overlapping and erasure of lists. Communications of the ACM 3, 12, 655--657.
[19]
Michael Cox and David Ellsworth. 1997. Application-controlled demand paging for out-of-core visualization. In Proceedings of the 8th Conference on Visualization’97 (VIS’97). IEEE, Los Alamitos, CA, 235--ff. http://dl.acm.org/citation.cfm?id=266989.267068
[20]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.
[21]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review 41, 205--220.
[22]
David Detlefs, Christine Flood, Steve Heller, and Tony Printezis. 2004. Garbage-first garbage collection. In Proceedings of the 4th International Symposium on Memory Management. ACM, New York, NY, 37--48.
[23]
Edsger W. Dijkstra, Leslie Lamport, Alain J. Martin, Carel S. Scholten, and Elisabeth F. M. Steffens. 1978. On-the-fly garbage collection: An exercise in cooperation. Communications of the ACM 21, 11, 966--975.
[24]
R. Dimpsey, R. Arora, and K. Kuiper. 2000. Java server performance: A case study of building efficient, scalable JVMs. IBM Systems Journal 39, 1, 151--174.
[25]
Jens Dittrich and Jorge-Arnulfo Quiané-Ruiz. 2012. Efficient big data processing in Hadoop MapReduce. Proceedings of the VLDB Endowment 5, 12, 2014--2015.
[26]
David Gay and Bjarne Steensgaard. 2000. Fast escape analysis and stack allocation for object-based programs. In Compiler Construction. Springer, 82--93.
[27]
Lars George. 2011. HBase: The Definitive Guide. O’Reilly Media, Inc.
[28]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. ACM SIGOPS Operating Systems Review 37, 29--43.
[29]
Lokesh Gidra, Gaël Thomas, Julien Sopena, and Marc Shapiro. 2013. A study of the scalability of stop-the-world garbage collectors on multicores. ACM SIGPLAN Notices 48, 229--240.
[30]
Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A garbage collector for Big Data on Big NUMA machines. ACM SIGARCH Computer Architecture News 43, 661--673.
[31]
Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytiniotis, Ganesan Ramalingam, Manuel Costa, Derek G. Murray, Steven Hand, and Michael Isard. 2015. Broom: Sweeping out garbage collection from Big Data systems. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS XV).
[32]
James Gosling. 2000. The Java Language Specification. Addison-Wesley Professional.
[33]
Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. Proceedings of the VLDB Endowment 4, 1, 1111--1122.
[34]
Richard L. Hudson and J. Eliot B. Moss. 1992. Incremental collection of mature objects. In Memory Management. Springer, 388--403.
[35]
R. John M. Hughes. 1982. A semi-incremental garbage collection algorithm. Software: Practice and Experience 12, 11, 1081--1082.
[36]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41, 59--72.
[37]
Richard Jones, Antony Hosking, and Eliot Moss. 2011. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman 8 Hall/CRC.
[38]
Richard Jones and Andy C. King. 2005. A fast analysis for thread-local garbage collection with dynamic class loading. In Proceedings of the 5th IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, Los Alamitos, CA, 129--138.
[39]
Richard Jones and Chris Ryder. 2006. Garbage collection should be lifetime aware. In Proceedings of the Implementation, Compilation, Optimization of Object-Oriented Languages, Programs, and Systems Conference (ICOOOLPS’06).
[40]
Richard E. Jones and Chris Ryder. 2008. A study of Java object demographics. In Proceedings of the 7th International Symposium on Memory Management. ACM, New York, NY, 121--130.
[41]
Kenneth C. Knowlton. 1965. A fast storage allocator. Communications of the ACM 8, 10, 623--624.
[42]
Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 31--46.
[43]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2, 35--40.
[44]
Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7, 558--565.
[45]
Henry Lieberman and Carl Hewitt. 1983. A real-time garbage collector based on the lifetimes of objects. Communications of the ACM 26, 6, 419--429.
[46]
Jimmy Lin and Dmitriy Ryaboy. 2013. Scaling big data mining infrastructure: The Twitter experience. ACM SIGKDD Explorations Newsletter 14, 2, 6--19.
[47]
Lu Lu, Xuanhua Shi, Yongluan Zhou, Xiong Zhang, Hai Jin, Cheng Pei, Ligang He, and Yuanzhen Geng. 2016. Lifetime-based memory management for distributed data processing systems. Proceedings of the VLDB Endowment 9, 12, 936--947.
[48]
Clifford Lynch. 2008. Big data: How do your data grow? Nature 455, 7209, 28--29.
[49]
Martin Maas, Krste Asanović, Tim Harris, and John Kubiatowicz. 2016. Taurus: A holistic language runtime system for coordinating distributed managed-language applications. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 457--471.
[50]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 135--146.
[51]
John McCarthy. 1960. Recursive functions of symbolic expressions and their computation by machine, part I. Communications of the ACM 3, 4, 184--195.
[52]
David A. Moon. 1984. Garbage collection in a large LISP system. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming. ACM, New York, NY, 235--246.
[53]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. 2013. Naiad: A timely dataflow system. In Proceedings of the 24th ACM Symposium on Operating Systems Principles. ACM, New York, NY, 439--455.
[54]
Scott Nettles, James O’Toole, David Pierce, and Nicholas Haines. 1992. Replication-Based Incremental Copying Collection. Springer.
[55]
Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu. 2015. Facade: A compiler and runtime for (almost) object-bounded big data applications. ACM SIGPLAN Notices 50, 675--690.
[56]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1099--1110.
[57]
H. Paz, D. F. Bacon, E. K. Kolodner, E. Petrank, and V. T. Rajan. 2007. An efficient on-the-fly cycle collection. ACM Transactions on Programming Languages and Systems 29, 4, Article No. 20.
[58]
Ian Robinson, Jim Webber, and Emil Eifrem. 2013. Graph Databases. O’Reilly Media, Inc.
[59]
Semih Salihoglu and Jennifer Widom. 2013. GPS: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. ACM, New York, NY, 22.
[60]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, Los Alamitos, CA, 1--10.
[61]
Sunil Soman, Chandra Krintz, and Laurent Daynès. 2008. MTM2: Scalable memory management for multi-tasking managed runtime environments. In ECOOP 2008—Object-Oriented Programming. Springer, 335--361.
[62]
C. J. Stephenson. 1983. New methods for dynamic storage allocation (fast fits). In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP’83). ACM, New York, NY, 30--32.
[63]
Roshan Sumbaly, Jay Kreps, and Sam Shah. 2013. The big data ecosystem at LinkedIn. In Proceedings of the 2013 International Conference on Management of Data. ACM, New York, NY, 1125--1134.
[64]
Andrew S. Tanenbaum. 2007. Modern Operating Systems. Prentice Hall Press.
[65]
Gil Tene, Balaji Iyengar, and Michael Wolf. 2011. C4: The Continuously Concurrent Compacting Collector. ACM SIGPLAN Notices 46, 11, 79--88.
[66]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment 2, 2, 1626--1629.
[67]
David Ungar. 1984. Generation scavenging: A non-disruptive high performance storage reclamation algorithm. ACM SIGPLAN Notices 19, 5, 157--167.
[68]
David Ungar and Frank Jackson. 1988. Tenuring policies for generation-based storage reclamation. ACM SIGPLAN Notices 23, 1--17.
[69]
Rik Van Bruggen. 2014. Learning Neo4j. Packt Publishing Ltd.
[70]
Tom White. 2009. Hadoop: The Definitive Guide. O’Reilly Media, Inc.
[71]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI’08), Vol. 8. 1--14.
[72]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. 10.

Cited By

View all
  • (2023)Sustainable Development of Business Economy Based on Big Data Algorithm under the Background of Low-Carbon EconomySustainability10.3390/su1507584015:7(5840)Online publication date: 28-Mar-2023
  • (2023)One-shot Garbage Collection for In-memory OLTP through Temporality-aware Version StorageProceedings of the ACM on Management of Data10.1145/35886991:1(1-25)Online publication date: 30-May-2023
  • (2023)An Empirical Study of Fault Triggers in Deep Learning FrameworksIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.315223920:4(2696-2712)Online publication date: 1-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 51, Issue 1
January 2019
743 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3177787
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 January 2018
Accepted: 01 October 2017
Revised: 01 July 2017
Received: 01 November 2016
Published in CSUR Volume 51, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Data
  2. Big Data environment
  3. Garbage collection
  4. Java
  5. memory managed runtime
  6. processing platforms
  7. scalability
  8. storage platform

Qualifiers

  • Survey
  • Research
  • Refereed

Funding Sources

  • FCT scholarships
  • Fundação para a Ciência e a Tecnologia (FCT)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)73
  • Downloads (Last 6 weeks)5
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Sustainable Development of Business Economy Based on Big Data Algorithm under the Background of Low-Carbon EconomySustainability10.3390/su1507584015:7(5840)Online publication date: 28-Mar-2023
  • (2023)One-shot Garbage Collection for In-memory OLTP through Temporality-aware Version StorageProceedings of the ACM on Management of Data10.1145/35886991:1(1-25)Online publication date: 30-May-2023
  • (2023)An Empirical Study of Fault Triggers in Deep Learning FrameworksIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.315223920:4(2696-2712)Online publication date: 1-Jul-2023
  • (2023) RemOrphan : Object Storage Sustainability Through Rem oving Offline-Processed Orphan Garbage Data IEEE Access10.1109/ACCESS.2023.331921711(107049-107067)Online publication date: 2023
  • (2023)BestGC: An Automatic GC SelectorIEEE Access10.1109/ACCESS.2023.329439811(72357-72373)Online publication date: 2023
  • (2023)Evaluation of a data-driven intelligent waste classification system for scientific management of garbage recycling in a Chinese communityEnvironmental Science and Pollution Research10.1007/s11356-023-28639-x30:37(87913-87924)Online publication date: 11-Jul-2023
  • (2021)Effect of Garbage Collection on Streaming Big Data Workloads2021 2nd Global Conference for Advancement in Technology (GCAT)10.1109/GCAT52182.2021.9587808(1-6)Online publication date: 1-Oct-2021
  • (2021)Influencing Factors in the Scalability of Distributed Stream Processing JobsIEEE Access10.1109/ACCESS.2021.31026459(109413-109431)Online publication date: 2021
  • (2021)A Comparative Analysis of Garbage Collectors and Their Suitability for Big Data WorkloadsAdvances in Computing and Network Communications10.1007/978-981-33-6977-1_24(305-316)Online publication date: 21-Apr-2021
  • (2020)Modeling and Management Big Data in Databases—A Systematic Literature ReviewSustainability10.3390/su1202063412:2(634)Online publication date: 15-Jan-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media