[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

A visual analysis approach for data transformation via domain knowledge and intelligent models

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Industry benchmarking involves comparing and analyzing a company’s performance with other top-performing enterprises. PDF documents contain valuable corporate information, but their non-editable nature makes data extraction complex. This study focuses on converting unstructured data from PDF documents, including tables, images, and text, to a structured format that is suitable for analysis and decision-making. The methods that are currently used for PDF document conversion primarily involve manual extraction, PDF converters, and artificial intelligence algorithms. However, they are often restricted to processing a single modality, have limitations in dealing with complex structured tables, or cannot achieve the required accuracy in practice. This study focuses on converting the periodic reports documents of listed companies from PDF format to structured data. We propose a unified framework for extracting tables, images, and text by parsing PDF documents into constituent objects. We introduce three bespoke algorithms to process complex structured tables and to develop a prototype system of visual analysis that combines AI for automated data extraction with the domain knowledge of human experts for auditing. Quantitative and qualitative experiments are conducted to validate the methodology’s superiority, including its efficiency, quality, and user-friendliness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The tagged data set used in this article is available on request from the corresponding author.

Notes

  1. https://github.com/jsvine/pdfplumber.

  2. https://github.com/pdfminer/pdfminer.six.

References

  1. Passos, C.A.S., Haddad, R.B.B.: Benchmarking: a tool for the improvement of production management. In: 6th IFAC Conference on Management and Control of Production and Logistics, pp. 577–581. Elsevier, Fortaleza, Brazil (2013)

  2. Zhu, M., Cole, J.M.: Pdfdataextractor: a tool for reading scientific text and interpreting metadata from the typeset literature in the portable document format. J. Chem. Inf. Model. 62(7), 1633–1644 (2022)

    Google Scholar 

  3. Roy, S., Sharma, P., Nath, K., Bhattacharyya, D.K., Kalita, J.K.: Pre-processing: a data preparation step. Encyclop. Bioinform. Comput. Biol. 1, 463–471 (2019)

    Google Scholar 

  4. Shokraneh, F., Adams, C.E.: Increasing value and reducing waste in data extraction for systematic reviews: tracking data in data extraction forms. Syst. Rev. 6(1), 153 (2017)

    Google Scholar 

  5. Strouthopoulos, C., Papamarkos, N.: Text identification for document image analysis using a neural network. Image Vis. Comput. 16(12–13), 879–896 (1998)

    Google Scholar 

  6. Zhang, W.: Converting pdf files to xml files. Electron. Lib. 26(1), 68–74 (2008)

    MathSciNet  Google Scholar 

  7. Nguyen, K., Nguyen, A., Vo, N.D., Nguyen, T.V.: Vietnamese document analysis: dataset, method and benchmark suite. IEEE Access 10, 108046–108066 (2022)

    Google Scholar 

  8. Grijalva, F., Santos, E., Acuña, B., Rodríguez, J.C., Larco, J.C.: Deep learning in time–frequency domain for document layout analysis. IEEE Access 9, 151254–151265 (2021)

    Google Scholar 

  9. Rizvi, S.T.R., Dengel, A., Ahmed, S.: A hybrid approach and unified framework for bibliographic reference extraction. IEEE Access 8, 217231–217245 (2020)

    Google Scholar 

  10. Ahmed, M.W., Afzal, M.T.: FLAG-PDFe: features oriented metadata extraction framework for scientific publications. IEEE Access 8, 99458–99469 (2020)

    Google Scholar 

  11. Davila, K., Setlur, S., Doermann, D., Kota, B.U., Govindaraju, V.: Chart mining: a survey of methods for automated chart analysis. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 3799–3819 (2021)

    Google Scholar 

  12. Hashmi, K.A., Liwicki, M., Stricker, D., Afzal, M.A., Afzal, M.Z.: Current status and performance analysis of table recognition in document images with deep neural networks. IEEE Access 9, 87663–87685 (2021)

    Google Scholar 

  13. Utomo, V., Jenq-Shiou, L.: Automatic news-roundup generation using clustering, extraction, and presentation. Multimed. Syst. 26, 201–221 (2020)

    Google Scholar 

  14. Shigarov, A., Khristyuk, V., Mikhailov, A.: TabbyXL: software platform for rule-based spreadsheet data extraction and transformation. SoftwareX 10, 100270 (2019)

    Google Scholar 

  15. Kim, J., Hwang, H.: A rule-based method for table detection in website images. IEEE Access 8, 81022–81033 (2020)

    Google Scholar 

  16. Lou, R., Lv, Z., Dang, S., Su, T., Li, X.: Application of machine learning in ocean data. Multimed. Syst. 29, 1815–1824 (2023)

    Google Scholar 

  17. Zhang, D., Mao, R., Guo, R., Jiang, Y., Zhu, J.: Yolo-table: disclosure document table detection with involution. Int. J. Doc. Anal. Recogn. 26(1), 1–14 (2023)

    Google Scholar 

  18. Hashmi, K.A., Stricker, D., Liwicki, M., Afzal, M.N., Afzal, M.Z.: Guided table structure recognition through anchor optimization. IEEE Access 9, 113521–113534 (2021)

    Google Scholar 

  19. Jiang, J.C., Simsek, M., Kantarci, B., Khan, S.: Tabcellnet: deep learning-based tabular cell structure detection. Neurocomputing 440, 12–23 (2021)

    Google Scholar 

  20. Tsai, M.-J., Tao, Y.-H., Yuadi, I.: Deep learning for printed document source identification. Sig. Process. Image Commun. 70, 184–198 (2019)

    Google Scholar 

  21. Wei, Y., Xia, W., Lin, M., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: HCP: a flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1901–1907 (2015)

    Google Scholar 

  22. Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J.: Multilabel image classification with regional latent semantic dependencies. IEEE Trans. Multimed. 20(10), 2801–2813 (2018)

    Google Scholar 

  23. Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2), 201–219 (2001)

    Google Scholar 

  24. Zaman, G., Mahdin, H., Hussain, K., Atta-Ur-Rahman, Abawajy, J., Mostafa, S.A.: An ontological framework for information extraction from diverse scientific sources. IEEE Access 9, 42111–42124 (2021)

    Google Scholar 

  25. Budhiraja, S.S., Mago, V.: A supervised learning approach for heading detection. Expert Syst. 37(4), 1–15 (2020)

    Google Scholar 

  26. Li, X., Li, Y., Yang, J., Liu, H., Hu, P.: A relation aware embedding mechanism for relation extraction. Appl. Intell. 52, 10022–10031 (2022)

    Google Scholar 

  27. Geng, Z., Zhang, Y., Han, Y.: Joint entity and relation extraction model based on rich semantics. Neurocomputing 429, 132–140 (2021)

    Google Scholar 

  28. Fidalgo, E., Alegre, E., González-Castro, V., Fernández-Robles, L.: Compass radius estimation for improved image classification using edge-sift. Neurocomputing 197, 119–135 (2016)

    Google Scholar 

  29. Attarmoghaddam, N., Li, K.F.: An area-efficient FPGA implementation of a real-time multi-class classifier for binary images. IEEE Trans. Circ. Syst. Ii-Express Briefs 69(4), 2306–2310 (2022)

    Google Scholar 

  30. Xue, L., Jiang, D., Wang, R., Yang, J., Hu, M.: Learning semantic dependencies with channel correlation for multi-label classification. Vis. Comput. 36(3), 1325–1335 (2020)

    Google Scholar 

  31. Wang, Y., Xie, Y., Zeng, J., Wang, H., Fan, L., Song, Y.: Cross-modal fusion for multi-label image classification with attention mechanism. Comput. Electr. Eng. 101, 108002 (2022)

    Google Scholar 

  32. Shakarami, A., Menhaj, M.B., Tarrah, H.: Diagnosing Covid-19 disease using an efficient cad system. Optik 241, 167199 (2021)

    Google Scholar 

  33. Alhichri, H., Bazi, Y., Alajlan, N.: Assisting the visually impaired in multi-object scene description using OWA-based fusion of CNN models. Arab. J. Sci. Eng. 45(12), 10511–10527 (2020)

    Google Scholar 

  34. Fu, Y., Song, J., Xie, F., Bai, Y., Zheng, X., Gao, P., Wang, Z., Xie, S.: Circular fruit and vegetable classification based on optimized GoogLeNet. IEEE Access 9, 113599–113611 (2021)

    Google Scholar 

  35. Wang, J., Wang, K.: Bert-based semi-supervised domain adaptation for disastrous classification. Multimed. Syst. 28, 2237–2246 (2022)

    Google Scholar 

  36. Chen, J., Yang, T., Zhang, D., Huang, H., Tian, Y.: Deep learning based classification of rock structure of tunnel face. Geosci. Front. 12(1), 395–404 (2021)

    Google Scholar 

  37. Jiang, L., Li, C., Wang, S., Zhang, L.: Deep feature weighting for Naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 52, 26–39 (2016)

    Google Scholar 

  38. Xu, S.: Bayesian Naïve Bayes classifiers to text classification. J Inf Sci 44(1), 48–59 (2018)

    Google Scholar 

  39. Kumar, M.A., Gopal, M.: A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recogn. Lett. 31(11), 1437–1444 (2010)

    Google Scholar 

  40. Sabbah, T., Ayyash, M., Ashraf, M.: Hybrid support vector machine based feature selection method for text classification. Int Arab J Inf Technol 15(3A), 599–609 (2018)

    Google Scholar 

  41. Remeikis, N., Skučas, I., Melninkaitė, V.: Text categorization using neural networks initialized with decision trees. Informatica 15(4), 551–564 (2004)

    Google Scholar 

  42. Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for multi-label document classification. Mach. Learn. 88, 157–208 (2012)

    MathSciNet  Google Scholar 

  43. Deng, J., Cheng, L., Wang, Z.: Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput. Speech Lang. 68, 101182 (2021)

    Google Scholar 

  44. Abas, A.R., Elhenawy, I., Zidan, M., Othman, M.: BERT-CNN: a deep learning model for detecting emotions from text. Comput. Mater. Contin. 71(2), 2943–2961 (2022)

    Google Scholar 

  45. Wang, Z., Wang, L., Huang, C., Sun, S., Luo, X.: Bert-based Chinese text classification for emergency domain with a novel loss function. Appl. Intell. 53(9), 10417–10428 (2023)

    Google Scholar 

  46. Yuan, J., Chen, C., Yang, W., Liu, M., Xia, J., Liu, S.: A survey of visual analytics techniques for machine learning. Comput. Vis. Med. 7, 3–36 (2021)

    Google Scholar 

  47. Zhang, C., Wang, H.: Resumevis: a visual analytics system to discover semantic information in semi-structured resume data. ACM Trans. Intell. Syst. Technol. 10(1), 1–25 (2018)

    MathSciNet  Google Scholar 

  48. Shi, L., Teng, Z., Wang, L., Zhang, Y., Binder, A.: DeepClue: visual interpretation of text-based deep stock prediction. IEEE Trans. Knowl. Data Eng. 31(6), 1094–1108 (2019)

    Google Scholar 

  49. Onah, D.F.O., Pang, E.L.L., El-Haj, M.: A data-driven latent semantic analysis for automatic text summarization using LDA topic modelling. In: 2022 IEEE International Conference on Big Data, pp. 2771–2780. IEEE, Osaka, Japan (2022)

  50. Yang, Y., Yao, Q., Qu, H.: Vistopic: a visual analytics system for making sense of large document collections using hierarchical topic modeling. Vis. Inform. 1(1), 40–47 (2017)

    Google Scholar 

  51. Nurminen, A.: Algorithmic extraction of data in tables in pdf documents. Master’s thesis, Tampereen yliopisto (May 2013). https://urn.fi/URN:NBN:fi:tty-201305231166

  52. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. Preprint arXiv:1508.01991 (2015)

  53. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805 (2018)

  54. David, F.G., Jr.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)

    MathSciNet  Google Scholar 

  55. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  56. Kim, Y.: Convolutional neural networks for sentence classification. Preprint arXiv:1408.5882 (2014)

Download references

Acknowledgements

This work was supported in part by the project supported by the Key R &D “Pioneer” Tackling Plan Program of Zhejiang Province, China (No. 2023C01119), in part by the “Ten Thousand Talents Plan” Science and Technology Innovation Leading Talent Program of Zhejiang Province, China (No. 2022R52044) and in part by the Major Standardization Pilot Projects for the Digital Economy (Digital Trade Sector) of Zhejiang Province, China (No. SJ-BZ/2023053). Thanks to Wenxuan Zhang, Jucai Lin, Heng Jin, Yu Chen, Zixuan Wang and Lingqian Zhu for their assistance and support in the writing of this article.

Author information

Authors and Affiliations

Authors

Contributions

Haiyang Zhu and Wei Chen wrote the main manuscript style, Jun Yin, Chengcan Chu, Minfeng Zhu, Yating Wei, Jiacheng Pan and Dongming Han optimized it, Haiyang Zhu, Chengcan Chu, and Xuwei Tan were responsible for system development and data collection. All the authors read the manuscript.

Corresponding author

Correspondence to Wei Chen.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest related to the content of this article.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, H., Yin, J., Chu, C. et al. A visual analysis approach for data transformation via domain knowledge and intelligent models. Multimedia Systems 30, 126 (2024). https://doi.org/10.1007/s00530-024-01331-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01331-x

Keywords

Navigation