[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Data and Process Quality Evaluation in a Textual Big Data Archiving System

Published: 20 March 2022 Publication History

Abstract

The article presents a textual Big Data analytics solution developed in a real setting as a part of a high-capacity document digitization and storage system. A software based on machine learning techniques performs automated extraction and processing of textual contents. The work focuses on performance and data confidence evaluation and describes the approach to computing a set of indicators for textual data quality. It then presents experimental results.

References

[1]
Y. Wang, L. Kung, and T. A. Byrd. 2018. Big Data analytics: Understanding its capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change 126 (2018), 3–13.
[2]
W. Lee and D. H. Lee. 2019. Cultural heritage and the Intelligent Internet of Things. Journal on Computing and Cultural Heritage 12, 3 (2019), 1–14.
[3]
M. Fugini, J. Finocchi, F. Leccardi, P. Locatelli, and A. Lupi. 2019. A text analytics architecture for smart companies. In Proceedings of the 2019 IEEE 28th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE’19). IEEE, Los Alamitos, CA, 271–276.
[4]
T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition. 683–687.
[5]
J. Memon, M. Sami, R. A. Khan, and M. Uddin. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 8 (2020), 142642–142668
[6]
G. Chen, Q. Chen, X. Zhu, and Y. Chen. 2017. A study of historical documents denoising. In Proceedings of the 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering, and Informatics (CISP-BMEI’17). IEEE, Los Alamitos, CA, 1–4.
[7]
H. Neji, J. Nogueras-Iso, J. Lacasta, M. B. Halima, and A. M. Alimi. 2019. Adversarial autoencoders for denoising digitized historical documents: The use case of incunabula. In Proceedings of the 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW’19). IEEE, Los Alamitos, CA, 31–34.
[8]
R. Pintus, Y. Yang, and H. Rushmeier. 2015. ATHENA: Automatic text height extraction for the analysis of text lines in old handwritten manuscripts. Journal on Computing and Cultural Heritage 8, 1 (2015), 1–25.
[9]
J. Martínek, L. Lenc, and P. Král. 2020. Building an efficient OCR system for historical documents with little training data. Neural Computing and Applications 32 (2020), 17209–17227.
[10]
Y. Yang, R. Pintus, E. Gobbetti, and H. Rushmeier. 2017. Automatic single page-based algorithms for medieval manuscript analysis. Journal on Computing and Cultural Heritage 10, 2 (2017), 1–22.
[11]
A. Gupta, R. Gutierrez-Osuna, M. Christy, R. Furuta, and L. Mandell. 2016. Font identification in historical documents using active learning. arXiv preprint arXiv:1601.07252 (2017).
[12]
S. S. Bukhari, A. Kadi, M. A. Jouneh, F. M. Mir, and A. Dengel. 2017. anyOCR: An open-source OCR system for historical archives. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). 305–310. DOI:
[13]
A. Jatowt, M. Coustaty, N. V. Nguyen, and A. Doucet. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL’19). IEEE, Los Alamitos, CA, 29–38.
[14]
A. Gupta, R. Gutierrez-Osuna, M. Christy, B. Capitanu, L. Auvil, L. Grumbach, R. Furuta, and L. Mandell. 2015. Automatic assessment of OCR quality in historical documents. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
[15]
M. Christy, A. Gupta, E. Grumbach, L. Mandell, R. Furuta, and R. Gutierrez-Osuna. 2017. Mass digitization of early modern texts with optical character recognition. Journal on Computing and Cultural Heritage 11, 1 (2017), 1–25.
[16]
S. Drobac, P. S. Kauppinen, and B. K. J. Linden. 2017. OCR and post-correction of historical Finnish texts. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa’17).
[17]
D. Hládek, J. Staš, and M. Pleva. 2020. Survey of automatic spelling correction. Electronics 9, 10 (2020), 1670.
[18]
C. Zhao and S. Sahni. 2019. String correction using the Damerau-Levenshtein distance. BMC Bioinformatics 20, 11 (2019), 277.
[19]
T. Ho, S. R. Oh, and H. Kim. 2017. A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PLoS One 12, 10 (2017), e0186251.
[20]
A. Todd, M. Nourian, and M. Becchi. 2017. A memory-efficient GPU method for Hamming and Levenshtein distance similarity. In Proceedings of the 2017 IEEE 24th International Conference on High Performance Computing (HiPC’17). IEEE, Los Alamitos, CA, 408–418.
[21]
M. U. Sadiq, M. M. Yousaf, L. Aslam, M. Aleem, S. Sarwar, and S. W. Jaffry. 2020. NvPD: Novel parallel edit distance algorithm, correctness, and performance evaluation. Cluster Computing 23, 2 (2020), 879–894.
[22]
C. Kiefer. 2019. Quality indicators for text data. In Proceedings of the 18th Symposium of Database Systems for Business, Technology, and the Web (BTS’19).
[23]
M. Fugini and J. Finocchi. 2018. Innovative Big Data analytics: A system for document management. In Proceedings of the 2018 IEEE 27th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE’18). IEEE, Los Alamitos, CA, 267–274.
[24]
D. Hládek, J. Staš, S. Ondáš, J. Juhár, and L. Kovács. 2017. Learning string distance with smoothing for OCR spelling correction. Multimedia Tools and Applications 76, 22 (2017), 24549–24567.
[25]
R. D. Lins, S. Banergee, and M. Thielo. 2010. Automatically detecting and classifying noises in document images. In Proceedings of the 2010 ACM Symposium on Applied Computing (SAC’10). 33–39.
[26]
C. Batini and M. Scannapieco. 2016. Data and Information Quality. Springer International, Cham, Switzerland.
[27]
Kissos Ido and Nachum Dershowitz. 2016. OCR error correction using character correction and feature-based word classification. In Proceedings of the 2016 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, Los Alamitos, CA.
[28]
Glenn Ford, Susan E. Hauser, Daniel X. Le, and George R. Thoma. 2000. Pattern matching techniques for correcting low-confidence OCR words in a known context. In Proceedings of SPIE 4307, Document Recognition and Retrieval VIII, Daniel P. Lopresti and Jianying Zhou (Eds.). Vol. 4307. International Society of Optics and Photonics, 1–9.

Cited By

View all
  • (2022)Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper ImagesApplied Sciences10.3390/app12211106312:21(11063)Online publication date: 1-Nov-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 15, Issue 1
February 2022
348 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/3505194
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 March 2022
Accepted: 01 April 2021
Received: 01 November 2020
Published in JOCCH Volume 15, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Data analytics
  2. unstructured Big Data
  3. text analytics
  4. machine learning
  5. data quality
  6. content management

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Seamless Project
  • Regione Lombardia (Italy)
  • EU Horizon 2020 Research and Innovation Programme
  • Working Age (Smart Working Environments for All Ages)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Context-Aware Querying, Geolocalization, and Rephotography of Historical Newspaper ImagesApplied Sciences10.3390/app12211106312:21(11063)Online publication date: 1-Nov-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media