[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3412841.3442075acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
poster

Automation and prioritization of replica balancing in HDFS

Published: 22 April 2021 Publication History

Abstract

The Hadoop Distributed File System (HDFS) is a reliable storage engine designed to run over commodity hardware. To provide reliability and read performance, HDFS has a storage model based on data replication and works best when the file blocks are evenly spread across the cluster. HDFS Balancer is an Apache Hadoop daemon created for replica balancing on the file system. However, the tool is not optimized to meet potential usage demands of reliability and availability during data redistribution, besides requiring to be manually configured and triggered. In this work, we present a solution for replica balancing that takes advantage of the combined use of a proactive and a reactive approach. The former is addressed through the active monitoring of the computational environment by an agent-server structure. The latter is based on the customization of the default operation policy of the HDFS Balancer. As shown by the evaluation results, the solution automates the use of the HDFS Balancer and allows it to execute according to the reliability of the racks and the availability of the data stored in the cluster.

References

[1]
Janakiram Dharanipragada, Srikant Padala, Balaji Kammili, and Vikram Kumar. 2017. Tula: A disk latency aware balancing and block placement strategy for Hadoop. In 2017 IEEE International Conference on Big Data. IEEE, 2853--2858.
[2]
Rhauani Fazul, Paulo Cardoso, and Patrícia Pitthan. 2019. Improving Data Availability in HDFS through Replica Balancing. In 2019 9th Latin-American Symposium on Dependable Computing (LADC). IEEE, 1--6.
[3]
Rhauani Fazul, Paulo Cardoso, and Patrícia Pitthan. 2019. O Balanceamento de Réplicas em um Cluster HDFS com base na Confiabilidade dos Racks. In Anais do IX Simpósio Brasileiro de Engenharia de Sistemas Computacionais. SBC, 31--38.
[4]
Apache Software Foundation. 2020. "HDFS". Retrieved Jul 27, 2020 from https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign
[5]
Hortonworks 2019. "Scaling Namespaces and Optimizing Data Storage". Retrieved Aug 1, 2020 from http://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/data-storage/content/balancing_data_across_hdfs_cluster.html
[6]
Flavio Junqueira and Benjamin Reed. 2013. ZooKeeper: Distributed Process Coordination (1st ed.). O'Reilly Media, Inc.
[7]
Chi-Yi Lin and Ying-Chen Lin. 2015. A load-balancing algorithm for hadoop distributed file system. In 2015 18th International Conference on Network-Based Information Systems. IEEE, Taipei, 173--179.
[8]
Ankit Shah and Mamta Padole. 2018. Load Balancing through Block Rearrangement Policy for Hadoop Heterogeneous Cluster. In 2018 International Conference on Advances in Computing, Communications and Informatics. IEEE, 230--236.
[9]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, 1--10.
[10]
Tom White. 2015. Hadoop: The Definitive Guide (4th ed.). O'Reilly Media, Inc.

Cited By

View all
  • (2022)PRBP: A prioritized replica balancing policy for HDFS balancerSoftware: Practice and Experience10.1002/spe.316253:3(600-630)Online publication date: 15-Nov-2022

Index Terms

  1. Automation and prioritization of replica balancing in HDFS

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing
      March 2021
      2075 pages
      ISBN:9781450381048
      DOI:10.1145/3412841
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 April 2021

      Check for updates

      Author Tags

      1. data availability
      2. racks reliability
      3. replica balancing

      Qualifiers

      • Poster

      Conference

      SAC '21
      Sponsor:
      SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing
      March 22 - 26, 2021
      Virtual Event, Republic of Korea

      Acceptance Rates

      Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

      Upcoming Conference

      SAC '25
      The 40th ACM/SIGAPP Symposium on Applied Computing
      March 31 - April 4, 2025
      Catania , Italy

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 16 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)PRBP: A prioritized replica balancing policy for HDFS balancerSoftware: Practice and Experience10.1002/spe.316253:3(600-630)Online publication date: 15-Nov-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media