A visual analysis approach for data transformation via domain knowledge and intelligent models

Haiyang Zhu^1,2,3,4,
Jun Yin²,
Chengcan Chu³,
Minfeng Zhu¹,
Yating Wei⁴,
Jiacheng Pan¹,
Dongming Han¹,
Xuwei Tan³ &
…
Wei Chen¹

228 Accesses
1 Citation
Explore all metrics

Abstract

Industry benchmarking involves comparing and analyzing a company’s performance with other top-performing enterprises. PDF documents contain valuable corporate information, but their non-editable nature makes data extraction complex. This study focuses on converting unstructured data from PDF documents, including tables, images, and text, to a structured format that is suitable for analysis and decision-making. The methods that are currently used for PDF document conversion primarily involve manual extraction, PDF converters, and artificial intelligence algorithms. However, they are often restricted to processing a single modality, have limitations in dealing with complex structured tables, or cannot achieve the required accuracy in practice. This study focuses on converting the periodic reports documents of listed companies from PDF format to structured data. We propose a unified framework for extracting tables, images, and text by parsing PDF documents into constituent objects. We introduce three bespoke algorithms to process complex structured tables and to develop a prototype system of visual analysis that combines AI for automated data extraction with the domain knowledge of human experts for auditing. Quantitative and qualitative experiments are conducted to validate the methodology’s superiority, including its efficiency, quality, and user-friendliness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Extraction of Tabular Data from PDF to CSV Files

Visual Data Science for Industrial Applications

Visual Data Analysis

Data availability

The tagged data set used in this article is available on request from the corresponding author.

Notes

References

Passos, C.A.S., Haddad, R.B.B.: Benchmarking: a tool for the improvement of production management. In: 6th IFAC Conference on Management and Control of Production and Logistics, pp. 577–581. Elsevier, Fortaleza, Brazil (2013)
Zhu, M., Cole, J.M.: Pdfdataextractor: a tool for reading scientific text and interpreting metadata from the typeset literature in the portable document format. J. Chem. Inf. Model. 62(7), 1633–1644 (2022)
Google Scholar
Roy, S., Sharma, P., Nath, K., Bhattacharyya, D.K., Kalita, J.K.: Pre-processing: a data preparation step. Encyclop. Bioinform. Comput. Biol. 1, 463–471 (2019)
Google Scholar
Shokraneh, F., Adams, C.E.: Increasing value and reducing waste in data extraction for systematic reviews: tracking data in data extraction forms. Syst. Rev. 6(1), 153 (2017)
Google Scholar
Strouthopoulos, C., Papamarkos, N.: Text identification for document image analysis using a neural network. Image Vis. Comput. 16(12–13), 879–896 (1998)
Google Scholar
Zhang, W.: Converting pdf files to xml files. Electron. Lib. 26(1), 68–74 (2008)
MathSciNet Google Scholar
Nguyen, K., Nguyen, A., Vo, N.D., Nguyen, T.V.: Vietnamese document analysis: dataset, method and benchmark suite. IEEE Access 10, 108046–108066 (2022)
Google Scholar
Grijalva, F., Santos, E., Acuña, B., Rodríguez, J.C., Larco, J.C.: Deep learning in time–frequency domain for document layout analysis. IEEE Access 9, 151254–151265 (2021)
Google Scholar
Rizvi, S.T.R., Dengel, A., Ahmed, S.: A hybrid approach and unified framework for bibliographic reference extraction. IEEE Access 8, 217231–217245 (2020)
Google Scholar
Ahmed, M.W., Afzal, M.T.: FLAG-PDFe: features oriented metadata extraction framework for scientific publications. IEEE Access 8, 99458–99469 (2020)
Google Scholar
Davila, K., Setlur, S., Doermann, D., Kota, B.U., Govindaraju, V.: Chart mining: a survey of methods for automated chart analysis. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 3799–3819 (2021)
Google Scholar
Hashmi, K.A., Liwicki, M., Stricker, D., Afzal, M.A., Afzal, M.Z.: Current status and performance analysis of table recognition in document images with deep neural networks. IEEE Access 9, 87663–87685 (2021)
Google Scholar
Utomo, V., Jenq-Shiou, L.: Automatic news-roundup generation using clustering, extraction, and presentation. Multimed. Syst. 26, 201–221 (2020)
Google Scholar
Shigarov, A., Khristyuk, V., Mikhailov, A.: TabbyXL: software platform for rule-based spreadsheet data extraction and transformation. SoftwareX 10, 100270 (2019)
Google Scholar
Kim, J., Hwang, H.: A rule-based method for table detection in website images. IEEE Access 8, 81022–81033 (2020)
Google Scholar
Lou, R., Lv, Z., Dang, S., Su, T., Li, X.: Application of machine learning in ocean data. Multimed. Syst. 29, 1815–1824 (2023)
Google Scholar
Zhang, D., Mao, R., Guo, R., Jiang, Y., Zhu, J.: Yolo-table: disclosure document table detection with involution. Int. J. Doc. Anal. Recogn. 26(1), 1–14 (2023)
Google Scholar
Hashmi, K.A., Stricker, D., Liwicki, M., Afzal, M.N., Afzal, M.Z.: Guided table structure recognition through anchor optimization. IEEE Access 9, 113521–113534 (2021)
Google Scholar
Jiang, J.C., Simsek, M., Kantarci, B., Khan, S.: Tabcellnet: deep learning-based tabular cell structure detection. Neurocomputing 440, 12–23 (2021)
Google Scholar
Tsai, M.-J., Tao, Y.-H., Yuadi, I.: Deep learning for printed document source identification. Sig. Process. Image Commun. 70, 184–198 (2019)
Google Scholar
Wei, Y., Xia, W., Lin, M., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S.: HCP: a flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1901–1907 (2015)
Google Scholar
Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J.: Multilabel image classification with regional latent semantic dependencies. IEEE Trans. Multimed. 20(10), 2801–2813 (2018)
Google Scholar
Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9(2), 201–219 (2001)
Google Scholar
Zaman, G., Mahdin, H., Hussain, K., Atta-Ur-Rahman, Abawajy, J., Mostafa, S.A.: An ontological framework for information extraction from diverse scientific sources. IEEE Access 9, 42111–42124 (2021)
Google Scholar
Budhiraja, S.S., Mago, V.: A supervised learning approach for heading detection. Expert Syst. 37(4), 1–15 (2020)
Google Scholar
Li, X., Li, Y., Yang, J., Liu, H., Hu, P.: A relation aware embedding mechanism for relation extraction. Appl. Intell. 52, 10022–10031 (2022)
Google Scholar
Geng, Z., Zhang, Y., Han, Y.: Joint entity and relation extraction model based on rich semantics. Neurocomputing 429, 132–140 (2021)
Google Scholar
Fidalgo, E., Alegre, E., González-Castro, V., Fernández-Robles, L.: Compass radius estimation for improved image classification using edge-sift. Neurocomputing 197, 119–135 (2016)
Google Scholar
Attarmoghaddam, N., Li, K.F.: An area-efficient FPGA implementation of a real-time multi-class classifier for binary images. IEEE Trans. Circ. Syst. Ii-Express Briefs 69(4), 2306–2310 (2022)
Google Scholar
Xue, L., Jiang, D., Wang, R., Yang, J., Hu, M.: Learning semantic dependencies with channel correlation for multi-label classification. Vis. Comput. 36(3), 1325–1335 (2020)
Google Scholar
Wang, Y., Xie, Y., Zeng, J., Wang, H., Fan, L., Song, Y.: Cross-modal fusion for multi-label image classification with attention mechanism. Comput. Electr. Eng. 101, 108002 (2022)
Google Scholar
Shakarami, A., Menhaj, M.B., Tarrah, H.: Diagnosing Covid-19 disease using an efficient cad system. Optik 241, 167199 (2021)
Google Scholar
Alhichri, H., Bazi, Y., Alajlan, N.: Assisting the visually impaired in multi-object scene description using OWA-based fusion of CNN models. Arab. J. Sci. Eng. 45(12), 10511–10527 (2020)
Google Scholar
Fu, Y., Song, J., Xie, F., Bai, Y., Zheng, X., Gao, P., Wang, Z., Xie, S.: Circular fruit and vegetable classification based on optimized GoogLeNet. IEEE Access 9, 113599–113611 (2021)
Google Scholar
Wang, J., Wang, K.: Bert-based semi-supervised domain adaptation for disastrous classification. Multimed. Syst. 28, 2237–2246 (2022)
Google Scholar
Chen, J., Yang, T., Zhang, D., Huang, H., Tian, Y.: Deep learning based classification of rock structure of tunnel face. Geosci. Front. 12(1), 395–404 (2021)
Google Scholar
Jiang, L., Li, C., Wang, S., Zhang, L.: Deep feature weighting for Naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 52, 26–39 (2016)
Google Scholar
Xu, S.: Bayesian Naïve Bayes classifiers to text classification. J Inf Sci 44(1), 48–59 (2018)
Google Scholar
Kumar, M.A., Gopal, M.: A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recogn. Lett. 31(11), 1437–1444 (2010)
Google Scholar
Sabbah, T., Ayyash, M., Ashraf, M.: Hybrid support vector machine based feature selection method for text classification. Int Arab J Inf Technol 15(3A), 599–609 (2018)
Google Scholar
Remeikis, N., Skučas, I., Melninkaitė, V.: Text categorization using neural networks initialized with decision trees. Informatica 15(4), 551–564 (2004)
Google Scholar
Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for multi-label document classification. Mach. Learn. 88, 157–208 (2012)
MathSciNet Google Scholar
Deng, J., Cheng, L., Wang, Z.: Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput. Speech Lang. 68, 101182 (2021)
Google Scholar
Abas, A.R., Elhenawy, I., Zidan, M., Othman, M.: BERT-CNN: a deep learning model for detecting emotions from text. Comput. Mater. Contin. 71(2), 2943–2961 (2022)
Google Scholar
Wang, Z., Wang, L., Huang, C., Sun, S., Luo, X.: Bert-based Chinese text classification for emergency domain with a novel loss function. Appl. Intell. 53(9), 10417–10428 (2023)
Google Scholar
Yuan, J., Chen, C., Yang, W., Liu, M., Xia, J., Liu, S.: A survey of visual analytics techniques for machine learning. Comput. Vis. Med. 7, 3–36 (2021)
Google Scholar
Zhang, C., Wang, H.: Resumevis: a visual analytics system to discover semantic information in semi-structured resume data. ACM Trans. Intell. Syst. Technol. 10(1), 1–25 (2018)
MathSciNet Google Scholar
Shi, L., Teng, Z., Wang, L., Zhang, Y., Binder, A.: DeepClue: visual interpretation of text-based deep stock prediction. IEEE Trans. Knowl. Data Eng. 31(6), 1094–1108 (2019)
Google Scholar
Onah, D.F.O., Pang, E.L.L., El-Haj, M.: A data-driven latent semantic analysis for automatic text summarization using LDA topic modelling. In: 2022 IEEE International Conference on Big Data, pp. 2771–2780. IEEE, Osaka, Japan (2022)
Yang, Y., Yao, Q., Qu, H.: Vistopic: a visual analytics system for making sense of large document collections using hierarchical topic modeling. Vis. Inform. 1(1), 40–47 (2017)
Google Scholar
Nurminen, A.: Algorithmic extraction of data in tables in pdf documents. Master’s thesis, Tampereen yliopisto (May 2013). https://urn.fi/URN:NBN:fi:tty-201305231166
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. Preprint arXiv:1508.01991 (2015)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805 (2018)
David, F.G., Jr.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
MathSciNet Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. Preprint arXiv:1408.5882 (2014)

Download references

Acknowledgements

This work was supported in part by the project supported by the Key R &D “Pioneer” Tackling Plan Program of Zhejiang Province, China (No. 2023C01119), in part by the “Ten Thousand Talents Plan” Science and Technology Innovation Leading Talent Program of Zhejiang Province, China (No. 2022R52044) and in part by the Major Standardization Pilot Projects for the Digital Economy (Digital Trade Sector) of Zhejiang Province, China (No. SJ-BZ/2023053). Thanks to Wenxuan Zhang, Jucai Lin, Heng Jin, Yu Chen, Zixuan Wang and Lingqian Zhu for their assistance and support in the writing of this article.

Author information

Authors and Affiliations

The State Key Lab of CAD & CG, Zhejiang University, Hangzhou, 310058, China
Haiyang Zhu, Minfeng Zhu, Jiacheng Pan, Dongming Han & Wei Chen
College of Computer Science and Technology, Zhejiang University, Hangzhou, 310058, China
Haiyang Zhu & Jun Yin
Wuchan Zhongda Digital Technology Co., Ltd., Hangzhou, 310020, China
Haiyang Zhu, Chengcan Chu & Xuwei Tan
Wuchan Zhongda Group Co., Ltd., Hangzhou, 310006, China
Haiyang Zhu & Yating Wei

Authors

Haiyang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yin
View author publications
You can also search for this author in PubMed Google Scholar
Chengcan Chu
View author publications
You can also search for this author in PubMed Google Scholar
Minfeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yating Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jiacheng Pan
View author publications
You can also search for this author in PubMed Google Scholar
Dongming Han
View author publications
You can also search for this author in PubMed Google Scholar
Xuwei Tan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Haiyang Zhu and Wei Chen wrote the main manuscript style, Jun Yin, Chengcan Chu, Minfeng Zhu, Yating Wei, Jiacheng Pan and Dongming Han optimized it, Haiyang Zhu, Chengcan Chu, and Xuwei Tan were responsible for system development and data collection. All the authors read the manuscript.

Corresponding author

Correspondence to Wei Chen.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest related to the content of this article.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhu, H., Yin, J., Chu, C. et al. A visual analysis approach for data transformation via domain knowledge and intelligent models. Multimedia Systems 30, 126 (2024). https://doi.org/10.1007/s00530-024-01331-x

Download citation

Received: 10 January 2024
Accepted: 27 March 2024
Published: 20 April 2024
DOI: https://doi.org/10.1007/s00530-024-01331-x

A visual analysis approach for data transformation via domain knowledge and intelligent models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Extraction of Tabular Data from PDF to CSV Files

Visual Data Science for Industrial Applications

Visual Data Analysis

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A visual analysis approach for data transformation via domain knowledge and intelligent models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Extraction of Tabular Data from PDF to CSV Files

Visual Data Science for Industrial Applications

Visual Data Analysis

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation