More Web Proxy on the site http://driver.im/

research-article

Data and Process Quality Evaluation in a Textual Big Data Archiving System

Authors:

Mariagrazia Fugini,

Jacopo FinocchiAuthors Info & Claims

ACM Journal on Computing and Cultural Heritage (JOCCH), Volume 15, Issue 1

Article No.: 2, Pages 1 - 19

https://doi.org/10.1145/3461015

Published: 20 March 2022 Publication History

Abstract

The article presents a textual Big Data analytics solution developed in a real setting as a part of a high-capacity document digitization and storage system. A software based on machine learning techniques performs automated extraction and processing of textual contents. The work focuses on performance and data confidence evaluation and describes the approach to computing a set of indicators for textual data quality. It then presents experimental results.

References

[1]

Y. Wang, L. Kung, and T. A. Byrd. 2018. Big Data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change 126 (2018), 3–13.

[2]

W. Lee and D. H. Lee. 2019. Cultural heritage and the Intelligent Internet of Things. Journal on Computing and Cultural Heritage 12, 3 (2019), 1–14.

Digital Library

[3]

M. Fugini, J. Finocchi, F. Leccardi, P. Locatelli, and A. Lupi. 2019. A text analytics architecture for smart companies. In Proceedings of the 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE’19). IEEE, Los Alamitos, CA, 271–276.

[4]

T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition. 683–687.

Digital Library

[5]

J. Memon, M. Sami, R. A. Khan, and M. Uddin. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 8 (2020), 142642–142668

[6]

G. Chen, Q. Chen, X. Zhu, and Y. Chen. 2017. A study of historical documents denoising. In Proceedings of the 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering, and Informatics (CISP-BMEI’17). IEEE, Los Alamitos, CA, 1–4.

[7]

H. Neji, J. Nogueras-Iso, J. Lacasta, M. B. Halima, and A. M. Alimi. 2019. Adversarial autoencoders for denoising digitized historical documents: The use case of incunabula. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW’19). IEEE, Los Alamitos, CA, 31–34.

[8]

R. Pintus, Y. Yang, and H. Rushmeier. 2015. ATHENA: Automatic text height extraction for the analysis of text lines in old handwritten manuscripts. Journal on Computing and Cultural Heritage 8, 1 (2015), 1–25.

Digital Library

[9]

J. Martínek, L. Lenc, and P. Král. 2020. Building an efficient OCR system for historical documents with little training data. Neural Computing and Applications 32 (2020), 17209–17227.

Digital Library

[10]

Y. Yang, R. Pintus, E. Gobbetti, and H. Rushmeier. 2017. Automatic single page-based algorithms for medieval manuscript analysis. Journal on Computing and Cultural Heritage 10, 2 (2017), 1–22.

Digital Library

[11]

A. Gupta, R. Gutierrez-Osuna, M. Christy, R. Furuta, and L. Mandell. 2016. Font identification in historical documents using active learning. arXiv preprint arXiv:1601.07252 (2017).

[12]

S. S. Bukhari, A. Kadi, M. A. Jouneh, F. M. Mir, and A. Dengel. 2017. anyOCR: An open-source OCR system for historical archives. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). 305–310. DOI:

[13]

A. Jatowt, M. Coustaty, N. V. Nguyen, and A. Doucet. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL’19). IEEE, Los Alamitos, CA, 29–38.

[14]

A. Gupta, R. Gutierrez-Osuna, M. Christy, B. Capitanu, L. Auvil, L. Grumbach, R. Furuta, and L. Mandell. 2015. Automatic assessment of OCR quality in historical documents. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.

[15]

M. Christy, A. Gupta, E. Grumbach, L. Mandell, R. Furuta, and R. Gutierrez-Osuna. 2017. Mass digitization of early modern texts with optical character recognition. Journal on Computing and Cultural Heritage 11, 1 (2017), 1–25.

Digital Library

[16]

S. Drobac, P. S. Kauppinen, and B. K. J. Linden. 2017. OCR and post-correction of historical Finnish texts. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa’17).

[17]

D. Hládek, J. Staš, and M. Pleva. 2020. Survey of automatic spelling correction. Electronics 9, 10 (2020), 1670.

[18]

C. Zhao and S. Sahni. 2019. String correction using the Damerau-Levenshtein distance. BMC Bioinformatics 20, 11 (2019), 277.

[19]

T. Ho, S. R. Oh, and H. Kim. 2017. A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PLoS One 12, 10 (2017), e0186251.

[20]

A. Todd, M. Nourian, and M. Becchi. 2017. A memory-efficient GPU method for Hamming and Levenshtein distance similarity. In Proceedings of the 2017 IEEE 24th International Conference on High Performance Computing (HiPC’17). IEEE, Los Alamitos, CA, 408–418.

[21]

M. U. Sadiq, M. M. Yousaf, L. Aslam, M. Aleem, S. Sarwar, and S. W. Jaffry. 2020. NvPD: Novel parallel edit distance algorithm, correctness, and performance evaluation. Cluster Computing 23, 2 (2020), 879–894.

Digital Library

[22]

C. Kiefer. 2019. Quality indicators for text data. In Proceedings of the 18th Symposium of Database Systems for Business, Technology, and the Web (BTS’19).

[23]

M. Fugini and J. Finocchi. 2018. Innovative Big Data analytics: A system for document management. In Proceedings of the 2018 IEEE 27th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE’18). IEEE, Los Alamitos, CA, 267–274.

[24]

D. Hládek, J. Staš, S. Ondáš, J. Juhár, and L. Kovács. 2017. Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications 76, 22 (2017), 24549–24567.

Digital Library

[25]

R. D. Lins, S. Banergee, and M. Thielo. 2010. Automatically detecting and classifying noises in document images. In Proceedings of the 2010 ACM Symposium on Applied Computing (SAC’10). 33–39.

Digital Library

[26]

C. Batini and M. Scannapieco. 2016. Data and Information Quality. Springer International, Cham, Switzerland.

[27]

Kissos Ido and Nachum Dershowitz. 2016. OCR error correction using character correction and feature-based word classification. In Proceedings of the 2016 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, Los Alamitos, CA.

[28]

Glenn Ford, Susan E. Hauser, Daniel X. Le, and George R. Thoma. 2000. Pattern matching techniques for correcting low-confidence OCR words in a known context. In Proceedings of SPIE 4307, Document Recognition and Retrieval VIII, Daniel P. Lopresti and Jianying Zhou (Eds.). Vol. 4307. International Society of Optics and Photonics, 1–9.

Cited By

Ali DBlyau TVan de Weghe NVerstockt S(2022)Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper ImagesApplied Sciences10.3390/app12211106312:21(11063)Online publication date: 1-Nov-2022
https://doi.org/10.3390/app122111063

Index Terms

Data and Process Quality Evaluation in a Textual Big Data Archiving System
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
  2. Document management and text processing
    1. Document capture
      1. Document analysis
2. Computing methodologies
  1. Machine learning

Recommendations

A Brief Survey on Big Data in Healthcare

This article presents a brief introduction to big data and big data analytics and also their roles in the healthcare system. A definite range of scientific researches about big data analytics in the healthcare system have been reviewed. The definition ...
Multimedia Big Data Analytics: A Survey

With the proliferation of online services and mobile technologies, the world has stepped into a multimedia big data era. A vast amount of research work has been done in the multimedia area, targeting different aspects of big data analytics, such as the ...
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on Services

In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage

Journal on Computing and Cultural Heritage Volume 15, Issue 1

February 2022

348 pages

ISSN:1556-4673

EISSN:1556-4711

DOI:10.1145/3505194

Editor:
Franco Niccolucci
VAST-LAB at PIN, University of Florence, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 March 2022

Accepted: 01 April 2021

Received: 01 November 2020

Published in JOCCH Volume 15, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Seamless Project
Regione Lombardia (Italy)
EU Horizon 2020 Research and Innovation Programme
Working Age (Smart Working Environments for All Ages)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ali DBlyau TVan de Weghe NVerstockt S(2022)Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper ImagesApplied Sciences10.3390/app12211106312:21(11063)Online publication date: 1-Nov-2022
https://doi.org/10.3390/app122111063

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents