[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3603287.3651184acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
short-paper

Optimum Deep Learning Method for Document Layout Analysis in Low Resource Languages

Published: 27 April 2024 Publication History

Abstract

Document Layout Analysis (DLA) has become a crucial process in digitizing documents. Today, has become increasingly important to properly understand a digital document to get insights on the structure and contents of the document. DLA combines different techniques of image processing, computer vision, and natural language processing to help us perform various tasks such as character recognition, document classification, information retrieval, content summarization, document restructuring, etc. Gathering proper insights into the layout of a document is important to detect the identity of each element and its relationship. There have been many major Deep Learning based DLA algorithms invented recently which obtained impressive results in publicly available high-resource languages like English. However, there has been a significant shortage of available information on the effectiveness of Deep Learning based DLA approaches for low-resource languages. This paper investigates these state-of-the-art deep learning-based DLA approaches - DiT, LayoutLMv3, and YOLOv8 [9] to find the optimal approach for low-resource and grapheme-based languages like Bengali. We found out that YOLOv8 [9] performs the best with 8.95% better IoU score than DiT and 38.48% better IoU score than LayoutLMv3 for DLA task in low resource and grapheme-based language.

References

[1]
Galal M Binmakhashen and Sabri A Mahmoud. 2019. Document Layout Analysis: a Comprehensive Survey. ACM Computing Surveys (CSUR) 52, 6 (2019), 1--36.
[2]
Samuele Capobianco, Leonardo Scommegna, and Simone Marinai. 2018. Historical Handwritten Document Segmentation by Using a Weighted Loss. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Springer, 395--406.
[3]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88 (2010), 303--338.
[4]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2015. The KITTI Vision Benchmark Suite. URL http://www.cvlibs.net/datasets/kitti 2.5 (2015).
[5]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[6]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017).
[7]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[8]
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia. 4083--4091.
[9]
Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023. Ultralytics YOLO. https://github.com/ultralytics/ultralytics
[10]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (2012).
[11]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. 2020. The Open Images Dataset v4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale. International Journal of Computer Vision 128, 7 (2020), 1956--1981.
[12]
Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. 2022. Dit: Self-supervised Pre-training for Document Image Transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 3530--3539.
[13]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V. Springer International Publishing.
[14]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, Vol. 14. Springer International Publishing.
[15]
Simone Marinai, Marco Gori, and Giovanni Soda. 2005. Artificial Neural Networks for Document Analysis and Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1 (2005), 23--35.
[16]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[17]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems. Vol. 28.
[18]
Frank Y. Shih and Shy-Shyan Chen. 1996. Adaptive Document Block Segmentation and Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 26, 5 (1996), 797--802.
[19]
Md Istiak Hossain Shihab, Md Rakibul Hasan, Mahfuzur Rahman Emon, Syed Mobassir Hossen, Md Nazmuddoha Ansary, Intesur Ahmed, Fazle Rabbi Rakib, Shahriar Elahi Dhruvo, Souhardya Saha Dip, Akib Hasan Pavel, et al. 2023. Badlad: A Large Multi-domain Bengali Document Layout Analysis Dataset. In International Conference on Document Analysis and Recognition. Springer, 326--341.
[20]
Juan Terven and Diana Cordova-Esparza. 2023. A Comprehensive Review of YOLO: From YOLOv1 to YOLOv8 and Beyond. arXiv preprint arXiv:2304.00501 (2023).
[21]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. Advances in Neural Information Processing Systems 30 (2017).
[22]
Hao Wei, Micheal Baechler, Fouad Slimane, and Rolf Ingold. 2013. Evaluation of SVM, MLP and GMM Classifiers for Layout Analysis of Historical Documents. In The 12th International Conference on Document Analysis and Recognition. IEEE, 1220--1224.
[23]
Zhiheng Xu, Jie Tang, and Antonio J. Yepes. 2019. PubLayNet: Largest Dataset ever for Document Layout Analysis. arXiv (Cornell University) (2019). https://doi.org/10.48550/arxiv.1908.07836 arXiv:arxiv.1908.07836
[24]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-image Translation Using Cycle-consistent Adversarial Networks. (2017), 2223--2232.
[25]
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Index Terms

  1. Optimum Deep Learning Method for Document Layout Analysis in Low Resource Languages

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ACMSE '24: Proceedings of the 2024 ACM Southeast Conference
    April 2024
    337 pages
    ISBN:9798400702372
    DOI:10.1145/3603287
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 April 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Document Layout Analysis (DLA)
    2. Document Segmentation
    3. Optical Character Recognition (OCR)
    4. Region Segmentation
    5. Table Detection
    6. Text Detection

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    ACM SE '24
    Sponsor:
    ACM SE '24: 2024 ACM Southeast Conference
    April 18 - 20, 2024
    GA, Marietta, USA

    Acceptance Rates

    ACMSE '24 Paper Acceptance Rate 44 of 137 submissions, 32%;
    Overall Acceptance Rate 502 of 1,023 submissions, 49%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 74
      Total Downloads
    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media