[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In today’s world storing a large amount of data, large datasets, handling data in various forms is a challenging task. Data is getting produced rapidly with major small sized files. Hadoop is the solution for the big data problem except few limitations. This method is suggested to provide a better one for small file sizes in terms of storage, access effectiveness, and time. In contrast to the current methods, such as HDFS sequence files, HAR, and NHAR, a revolutionary strategy called VFS-HDFS architecture is created with the goal of optimizing small-sized files access problems. In HDFS When a user requests any file, the client will communicate to NameNode and NameNode will revert in the form of metadata of the file. The metadata contains the information about the blocks and locations. When the client gets this metadata information of a particular file, it communicates with the DataNodes and accesses the data sequentially. In the proposed work caching is introduced to store all the files. When a user requests for an existing file, the data will be retrieved from the cache itself preventing revisiting the NameNode followed by the DataNodes, which reduces the time improving access efficiency. Classification is used to classify the files as per their category and Bucket per category file table holds the metadata of the individual category wise container. The existing HDFS architecture has been wrapped with a virtual file system layer in the proposed development. However, the research is done without changing the HFDS architecture. Using this proposed system, better results are obtained in terms of access efficiency of small sized files in HDFS. A case study is performed on the British Library datasets on.txt and.rtf files. The proposed system can be used to enhance the library if the catalogue is categorized as per their category in a container reducing the storage, improving the access efficiency at the cost of memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The data used to support the findings of this study is available from the corresponding author upon request.

References

  1. Online Reference Apache Hadoop, http://hadoop.apache.org/.

  2. Alange, N., Mathur, A.: Optimization of small sized file access efficiency in hadoop distributed file system by integrating virtual file system layer. Int. J. Adv. Comput. Sci. Appl. (IJACSA), 13(6) (2022). https://doi.org/10.14569/IJACSA.2022.0130626

  3. Lyu, Y. Fan, X., Liu, K.: An optimized strategy for small files storing and accessing in HDFS (2017). https://doi.org/10.1109/CSE-EUC.2017.112

  4. Xiong, L et al.: A small file merging strategy for spatiotemporal data in smart health. In: IEEEAccess Special Section on Advanced Information Sensing and Learning Technologies for Data-Centric Smart Health Applications, vol. 7. (2019). https://doi.org/10.1109/ACCESS.2019.2893882

  5. Awais et al.: Performance efficiency in Hadoop for storing and accessing small files. In: 7th International Conference on Innovative Computing Technology (INTECH 2017), pp.211–216. https://doi.org/10.1109/INTECH.2017.8102449

  6. Zhai, Y., Tchaye-Kondi, J., Lin, K.-J., Zhu, L., Tao, W., Du, X., Guizani, M. Hadoop perfect file: a fast and memory-efficient metadata access archive file to face small files problem in HDFS. https://doi.org/10.1016/j.jpdc.2021.05.011

  7. Meng, M., Guo, W., Fan, G., Qian, N.: A novel approach for efficient accessing of small files in HDFS: TLB-MapFile. (2016). https://doi.org/10.1109/SNPD.2016.7515978

  8. Cai, X., Chen, C., Liang, U.: An optimization strategy of massive small files storage based on HDFS. (2018). https://doi.org/10.2991/jiaet-18.2018.40

  9. Yonghua, H., Wang, Z., Zeng, X., Yang, Y., Li, W., Cheng, Z., SFS: a massive small file processing middleware in Hadoop. In: IEICE, 18th Asia-Pacific Network Operations and Management Symposium (APNOMS) 2016. https://doi.org/10.1109/APNOMS.2016.7737234

  10. Zhipeng et al.: An effective merge strategy based hierarchy for improving small file problem on HDFS IEEE proceedings of CCIS. pp. 327–331. (2016)

  11. Alam et al.: Hadoop architecture and its issues. International conference on computational science and computational intelligence (CSCI), 2014 IEEE, 2 (2014)

  12. Sachin et al.: Dealing with small files problem in hadoop distributed file system”, Procedia Computer Science Volume 79, 2016. Ankita et al “A Novel Approach for Efficient Handling of Small Files in HDFS”, IEEE International Advance Computing Conference (IACC, 2015), pp. 1258–1262 (2016)

  13. Saravanan, N. et al.: Performance and classification evaluation of J48 algorithm and Kendall’s based J48

  14. Algorithm (KNJ48)” International journal of computational intelligence and informatics. 7(4) (2018)

  15. Alange, N., Mathur, A.: Small sized file storage problems in hadoop distributed file system. In: 2nd International conference on smart systems and inventive technology (ICSSIT 2019) IEEE Xplore Part Number: CFP19P17-ART; ISBN: 978-1-7281-2119-2

  16. Alange, N., Mathur, A.: Access efficiency of small sized files in Big data using various techniques on Hadoop distributed file system platform. Int. J. Comput. Sci. Netw. Secur. (IJCSNS), 21(7) (2021). https://doi.org/10.22937/IJCSNS.2021.21.7.41

  17. Xiong, L. et al.: A small file merging strategy for spatiotemporal data in smart health. ieeeaccess special section on advanced information sensing and learning technologies for data-centric smart health applications. 7 (2019)

  18. https://www.bl.uk/collection-metadata/identifier-services and E-mail Reference

  19. Tao, W., Zhai, Y., Tchaye-Kondi, J.: LHF: a new archive based approach to accelerate massive small files access performance in HDFS. (2019). https://doi.org/10.29007/rft1

Download references

Acknowledgements

The authors thank for providing characterization supports to complete this research work.

Funding

Author declared that no funding was received for this Research and Publication.

Author information

Authors and Affiliations

Authors

Contributions

Each author contributed equally in each part.

Corresponding author

Correspondence to Neeta Alange.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent for publications

The authors claim that none of the material in the paper has been published or is under consideration for publication elsewhere.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alange, N., Sagar, P.V. Small files access efficiency in hadoop distributed file system a case study performed on British library text files. Cluster Comput 26, 3381–3388 (2023). https://doi.org/10.1007/s10586-023-03992-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-023-03992-1

Keywords

Navigation