Small files access efficiency in hadoop distributed file system a case study performed on British library text files

Neeta Alange¹ &
P. Vidya Sagar¹

285 Accesses
Explore all metrics

Abstract

In today’s world storing a large amount of data, large datasets, handling data in various forms is a challenging task. Data is getting produced rapidly with major small sized files. Hadoop is the solution for the big data problem except few limitations. This method is suggested to provide a better one for small file sizes in terms of storage, access effectiveness, and time. In contrast to the current methods, such as HDFS sequence files, HAR, and NHAR, a revolutionary strategy called VFS-HDFS architecture is created with the goal of optimizing small-sized files access problems. In HDFS When a user requests any file, the client will communicate to NameNode and NameNode will revert in the form of metadata of the file. The metadata contains the information about the blocks and locations. When the client gets this metadata information of a particular file, it communicates with the DataNodes and accesses the data sequentially. In the proposed work caching is introduced to store all the files. When a user requests for an existing file, the data will be retrieved from the cache itself preventing revisiting the NameNode followed by the DataNodes, which reduces the time improving access efficiency. Classification is used to classify the files as per their category and Bucket per category file table holds the metadata of the individual category wise container. The existing HDFS architecture has been wrapped with a virtual file system layer in the proposed development. However, the research is done without changing the HFDS architecture. Using this proposed system, better results are obtained in terms of access efficiency of small sized files in HDFS. A case study is performed on the British Library datasets on.txt and.rtf files. The proposed system can be used to enhance the library if the catalogue is categorized as per their category in a container reducing the storage, improving the access efficiency at the cost of memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Efficient File Accessing Techniques on Hadoop Distributed File Systems

A Strategy for Small Files Processing in HDFS

Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization

Data availability

The data used to support the findings of this study is available from the corresponding author upon request.

References

Online Reference Apache Hadoop, http://hadoop.apache.org/.
Alange, N., Mathur, A.: Optimization of small sized file access efficiency in hadoop distributed file system by integrating virtual file system layer. Int. J. Adv. Comput. Sci. Appl. (IJACSA), 13(6) (2022). https://doi.org/10.14569/IJACSA.2022.0130626
Lyu, Y. Fan, X., Liu, K.: An optimized strategy for small files storing and accessing in HDFS (2017). https://doi.org/10.1109/CSE-EUC.2017.112
Xiong, L et al.: A small file merging strategy for spatiotemporal data in smart health. In: IEEEAccess Special Section on Advanced Information Sensing and Learning Technologies for Data-Centric Smart Health Applications, vol. 7. (2019). https://doi.org/10.1109/ACCESS.2019.2893882
Awais et al.: Performance efficiency in Hadoop for storing and accessing small files. In: 7th International Conference on Innovative Computing Technology (INTECH 2017), pp.211–216. https://doi.org/10.1109/INTECH.2017.8102449
Zhai, Y., Tchaye-Kondi, J., Lin, K.-J., Zhu, L., Tao, W., Du, X., Guizani, M. Hadoop perfect file: a fast and memory-efficient metadata access archive file to face small files problem in HDFS. https://doi.org/10.1016/j.jpdc.2021.05.011
Meng, M., Guo, W., Fan, G., Qian, N.: A novel approach for efficient accessing of small files in HDFS: TLB-MapFile. (2016). https://doi.org/10.1109/SNPD.2016.7515978
Cai, X., Chen, C., Liang, U.: An optimization strategy of massive small files storage based on HDFS. (2018). https://doi.org/10.2991/jiaet-18.2018.40
Yonghua, H., Wang, Z., Zeng, X., Yang, Y., Li, W., Cheng, Z., SFS: a massive small file processing middleware in Hadoop. In: IEICE, 18th Asia-Pacific Network Operations and Management Symposium (APNOMS) 2016. https://doi.org/10.1109/APNOMS.2016.7737234
Zhipeng et al.: An effective merge strategy based hierarchy for improving small file problem on HDFS IEEE proceedings of CCIS. pp. 327–331. (2016)
Alam et al.: Hadoop architecture and its issues. International conference on computational science and computational intelligence (CSCI), 2014 IEEE, 2 (2014)
Sachin et al.: Dealing with small files problem in hadoop distributed file system”, Procedia Computer Science Volume 79, 2016. Ankita et al “A Novel Approach for Efficient Handling of Small Files in HDFS”, IEEE International Advance Computing Conference (IACC, 2015), pp. 1258–1262 (2016)
Saravanan, N. et al.: Performance and classification evaluation of J48 algorithm and Kendall’s based J48
Algorithm (KNJ48)” International journal of computational intelligence and informatics. 7(4) (2018)
Alange, N., Mathur, A.: Small sized file storage problems in hadoop distributed file system. In: 2nd International conference on smart systems and inventive technology (ICSSIT 2019) IEEE Xplore Part Number: CFP19P17-ART; ISBN: 978-1-7281-2119-2
Alange, N., Mathur, A.: Access efficiency of small sized files in Big data using various techniques on Hadoop distributed file system platform. Int. J. Comput. Sci. Netw. Secur. (IJCSNS), 21(7) (2021). https://doi.org/10.22937/IJCSNS.2021.21.7.41
Xiong, L. et al.: A small file merging strategy for spatiotemporal data in smart health. ieeeaccess special section on advanced information sensing and learning technologies for data-centric smart health applications. 7 (2019)
https://www.bl.uk/collection-metadata/identifier-services and E-mail Reference
Tao, W., Zhai, Y., Tchaye-Kondi, J.: LHF: a new archive based approach to accelerate massive small files access performance in HDFS. (2019). https://doi.org/10.29007/rft1

Download references

Acknowledgements

The authors thank for providing characterization supports to complete this research work.

Funding

Author declared that no funding was received for this Research and Publication.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, KL Deemed To Be University, Vaddeswaram, AP, India
Neeta Alange & P. Vidya Sagar

Authors

Neeta Alange
View author publications
You can also search for this author in PubMed Google Scholar
P. Vidya Sagar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Each author contributed equally in each part.

Corresponding author

Correspondence to Neeta Alange.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent for publications

The authors claim that none of the material in the paper has been published or is under consideration for publication elsewhere.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Alange, N., Sagar, P.V. Small files access efficiency in hadoop distributed file system a case study performed on British library text files. Cluster Comput 26, 3381–3388 (2023). https://doi.org/10.1007/s10586-023-03992-1

Download citation

Received: 07 January 2023
Revised: 10 February 2023
Accepted: 18 March 2023
Published: 07 April 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10586-023-03992-1

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient File Accessing Techniques on Hadoop Distributed File Systems

A Strategy for Small Files Processing in HDFS

Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent for publications

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient File Accessing Techniques on Hadoop Distributed File Systems

A Strategy for Small Files Processing in HDFS

Hadoop Massive Small File Merging Technology Based on Visiting Hot-Spot and Associated File Optimization

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent for publications

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now