Abstract
In today’s world storing a large amount of data, large datasets, handling data in various forms is a challenging task. Data is getting produced rapidly with major small sized files. Hadoop is the solution for the big data problem except few limitations. This method is suggested to provide a better one for small file sizes in terms of storage, access effectiveness, and time. In contrast to the current methods, such as HDFS sequence files, HAR, and NHAR, a revolutionary strategy called VFS-HDFS architecture is created with the goal of optimizing small-sized files access problems. In HDFS When a user requests any file, the client will communicate to NameNode and NameNode will revert in the form of metadata of the file. The metadata contains the information about the blocks and locations. When the client gets this metadata information of a particular file, it communicates with the DataNodes and accesses the data sequentially. In the proposed work caching is introduced to store all the files. When a user requests for an existing file, the data will be retrieved from the cache itself preventing revisiting the NameNode followed by the DataNodes, which reduces the time improving access efficiency. Classification is used to classify the files as per their category and Bucket per category file table holds the metadata of the individual category wise container. The existing HDFS architecture has been wrapped with a virtual file system layer in the proposed development. However, the research is done without changing the HFDS architecture. Using this proposed system, better results are obtained in terms of access efficiency of small sized files in HDFS. A case study is performed on the British Library datasets on.txt and.rtf files. The proposed system can be used to enhance the library if the catalogue is categorized as per their category in a container reducing the storage, improving the access efficiency at the cost of memory.
Similar content being viewed by others
Data availability
The data used to support the findings of this study is available from the corresponding author upon request.
References
Online Reference Apache Hadoop, http://hadoop.apache.org/.
Alange, N., Mathur, A.: Optimization of small sized file access efficiency in hadoop distributed file system by integrating virtual file system layer. Int. J. Adv. Comput. Sci. Appl. (IJACSA), 13(6) (2022). https://doi.org/10.14569/IJACSA.2022.0130626
Lyu, Y. Fan, X., Liu, K.: An optimized strategy for small files storing and accessing in HDFS (2017). https://doi.org/10.1109/CSE-EUC.2017.112
Xiong, L et al.: A small file merging strategy for spatiotemporal data in smart health. In: IEEEAccess Special Section on Advanced Information Sensing and Learning Technologies for Data-Centric Smart Health Applications, vol. 7. (2019). https://doi.org/10.1109/ACCESS.2019.2893882
Awais et al.: Performance efficiency in Hadoop for storing and accessing small files. In: 7th International Conference on Innovative Computing Technology (INTECH 2017), pp.211–216. https://doi.org/10.1109/INTECH.2017.8102449
Zhai, Y., Tchaye-Kondi, J., Lin, K.-J., Zhu, L., Tao, W., Du, X., Guizani, M. Hadoop perfect file: a fast and memory-efficient metadata access archive file to face small files problem in HDFS. https://doi.org/10.1016/j.jpdc.2021.05.011
Meng, M., Guo, W., Fan, G., Qian, N.: A novel approach for efficient accessing of small files in HDFS: TLB-MapFile. (2016). https://doi.org/10.1109/SNPD.2016.7515978
Cai, X., Chen, C., Liang, U.: An optimization strategy of massive small files storage based on HDFS. (2018). https://doi.org/10.2991/jiaet-18.2018.40
Yonghua, H., Wang, Z., Zeng, X., Yang, Y., Li, W., Cheng, Z., SFS: a massive small file processing middleware in Hadoop. In: IEICE, 18th Asia-Pacific Network Operations and Management Symposium (APNOMS) 2016. https://doi.org/10.1109/APNOMS.2016.7737234
Zhipeng et al.: An effective merge strategy based hierarchy for improving small file problem on HDFS IEEE proceedings of CCIS. pp. 327–331. (2016)
Alam et al.: Hadoop architecture and its issues. International conference on computational science and computational intelligence (CSCI), 2014 IEEE, 2 (2014)
Sachin et al.: Dealing with small files problem in hadoop distributed file system”, Procedia Computer Science Volume 79, 2016. Ankita et al “A Novel Approach for Efficient Handling of Small Files in HDFS”, IEEE International Advance Computing Conference (IACC, 2015), pp. 1258–1262 (2016)
Saravanan, N. et al.: Performance and classification evaluation of J48 algorithm and Kendall’s based J48
Algorithm (KNJ48)” International journal of computational intelligence and informatics. 7(4) (2018)
Alange, N., Mathur, A.: Small sized file storage problems in hadoop distributed file system. In: 2nd International conference on smart systems and inventive technology (ICSSIT 2019) IEEE Xplore Part Number: CFP19P17-ART; ISBN: 978-1-7281-2119-2
Alange, N., Mathur, A.: Access efficiency of small sized files in Big data using various techniques on Hadoop distributed file system platform. Int. J. Comput. Sci. Netw. Secur. (IJCSNS), 21(7) (2021). https://doi.org/10.22937/IJCSNS.2021.21.7.41
Xiong, L. et al.: A small file merging strategy for spatiotemporal data in smart health. ieeeaccess special section on advanced information sensing and learning technologies for data-centric smart health applications. 7 (2019)
https://www.bl.uk/collection-metadata/identifier-services and E-mail Reference
Tao, W., Zhai, Y., Tchaye-Kondi, J.: LHF: a new archive based approach to accelerate massive small files access performance in HDFS. (2019). https://doi.org/10.29007/rft1
Acknowledgements
The authors thank for providing characterization supports to complete this research work.
Funding
Author declared that no funding was received for this Research and Publication.
Author information
Authors and Affiliations
Contributions
Each author contributed equally in each part.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent for publications
The authors claim that none of the material in the paper has been published or is under consideration for publication elsewhere.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alange, N., Sagar, P.V. Small files access efficiency in hadoop distributed file system a case study performed on British library text files. Cluster Comput 26, 3381–3388 (2023). https://doi.org/10.1007/s10586-023-03992-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-023-03992-1