Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14808))

Included in the following conference series:

International Conference on Document Analysis and Recognition

286 Accesses
2 Citations

Abstract

PyLaia is one of the most popular open-source software for Automatic Text Recognition (ATR), delivering strong performance in terms of speed and accuracy. In this paper, we outline our recent contributions to the PyLaia library, focusing on the incorporation of reliable confidence scores and the integration of statistical language modeling during decoding. Our implementation provides an easy way to combine PyLaia with n-grams language models at different levels. One of the highlights of this work is that language models are completely auto-tuned: they can be built and used easily without any expert knowledge, and without requiring any additional data. To demonstrate the significance of our contribution, we evaluate PyLaia’s performance on twelve datasets, both with and without language modelling. The results show that decoding with small language models improves the Word Error Rate by 13% and the Character Error Rate by 12% in average. Additionally, we conduct an analysis of confidence scores and highlight the importance of calibration techniques. Our implementation is publicly available in the official PyLaia repository (https://gitlab.teklia.com/atr/pylaia), and twelve open-source models are released on Hugging Face (https://huggingface.co/collections/Teklia/pylaia-65f16e9ae0aa03690e9e9f80).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian

PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

Multilingual Handwritten Text Recognition (MultiHTR) or Reading Your Grandma’s Old Letters in German, Russian, Serbian, and Ottoman Turkish with Artificial Intelligence

Notes

References

Beyer, Y., Solberg, P.E.: NorHand v3/Dataset for Handwritten Text Recognition in Norwegian (2023). https://doi.org/10.5281/zenodo.10255840
Beyer, Y., Solberg, P.E.: Norhand v2/Dataset for Handwritten Text Recognition in Norwegian [Data Set] (2024). https://doi.org/10.5281/zenodo.10555698
Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural Optical Understanding for Academic Documents (2023)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999). https://doi.org/10.1006/csla.1999.0128
Article Google Scholar
Constum, T., et al.: Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census. In: Uchida, S., Barney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 143–157. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_10
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17 (2023). https://doi.org/10.1109/tpami.2023.3235826
Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 508–524 (2023). https://doi.org/10.1109/TPAMI.2022.3144899
Article Google Scholar
Diaz, D.H., Qin, S., Ingle, R.R., Fujii, Y., Bissacco, A.: Rethinking text line recognition models. arXiv preprint arXiv:2104.07787 (2021)
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of the 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 1050–1059. PMLR, New York (2016). https://proceedings.mlr.press/v48/gal16.html
Grosicki, E., El-Abed, H.: ICDAR 2011 - French Handwriting Recognition Competition. In: 2011 International Conference on Document Analysis and Recognition, pp. 1459–1463 (2011). https://doi.org/10.1109/ICDAR.2011.290
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML 2017), vol. 70, pp. 1321–1330 JMLR.org (2017). https://doi.org/10.5555/3305381.3305518
Heafield, K.: KenLM: faster and smaller language model queries. In: Callison-Burch, C., Koehn, P., Monz, C., Zaidan, O.F. (eds.) Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics, Edinburgh (2011)
Google Scholar
Kiessling, B.: The Kraken OCR System. https://kraken.re
Kuang, Z., et al.: MMOCR: a comprehensive toolbox for text detection, recognition and understanding. arXiv preprint arXiv:2108.06543 (2021)
Kumar, S., Nirschl, M., Holtmann-Rice, D., Liao, H., Suresh, A.T., Yu, F.: Lattice rescoring strategies for long short term memory language models in speech recognition. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 165–172 (2017). https://doi.org/10.1109/ASRU.2017.8268931
Li, M., et al.: Trocr: transformer-based optical character recognition with pre-trained models. In: AAAI Conference on Artificial Intelligence (2021). https://doi.org/10.1609/aaai.v37i11.26538
Liu, C.L., Yin, F., Wang, D.H., Wang, Q.F.: CASIA online and offline Chinese handwriting databases. In: 2011 International Conference on Document Analysis and Recognition, pp. 37–41 (2011). https://doi.org/10.1109/ICDAR.2011.17
Maarand, M., Beyer, Y., Kåsen, A., Fosseide, K.T., Kermorvant, C.: A comprehensive comparison of open-source libraries for handwritten text recognition in Norwegian. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, 22–25 May 2022, pp. 399–413. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-06555-2_27
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. 5, 39–46 (2002)
Google Scholar
Muehlberger, G., Hackl, G.: NewsEye/READ OCR training dataset from Austrian Newspapers (19th C.) (2019). https://doi.org/10.5281/zenodo.3387369
Neto, A.F.S., Bezerra, B.L.D., Toselli, A.H., Lima, E.B.: HTR-Flor++: a handwritten text recognition system based on a pipeline of optical and language models. In: Proceedings of the ACM Symposium on Document Engineering 2020 (DocEng 2020). Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3395027.3419603
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011)
Google Scholar
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 67–72 (2017). https://doi.org/10.1109/ICDAR.2017.20
Puigcerver, J., Mocholí, C.: Pylaia. Commit SHA (2018). https://github.com/jpuigcerver/PyLaia/
Romero, V., et al.: The ESPOSALLES database: an ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn. 46(6), 1658–1669 (2013). https://doi.org/10.1016/j.patcog.2012.11.024
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Interspeech (2002)
Google Scholar
Stutzmann, D., Hamel, S., Kernier, I.D., Mühlberger, G., Hackl, G.: HIMANIS Guérin, Type: dataset (2021). https://doi.org/10.5281/zenodo.5535306
Stutzmann, D., Torres Aguilar, S., Chaffenet, P.: HOME-Alcar: aligned and annotated cartularies (2021). https://doi.org/10.5281/zenodo.5600884. Type: dataset
Tarride, S., Boillet, M., Kermorvant, C.: Key-value information extraction from full handwritten pages. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) Document Analysis and Recognition (ICDAR 2023), pp. 185–204. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_11
Tarride, S., Faine, T., Boillet, M., Mouchère, H., Kermorvant, C.: The belfort dataset: handwritten text recognition from crowdsourced annotations (2023). https://doi.org/10.5281/zenodo.8041668
Tassopoulou, V., Retsinas, G., Maragos, P.: Enhancing handwritten text recognition with N-gram sequence decomposition and multitask learning. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 10555–10560. IEEE Computer Society, Los Alamitos (2021). https://doi.org/10.1109/ICPR48806.2021.9412351
Voigtlaender, P., Doetsch, P., Ney, H.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233 (2016). https://doi.org/10.1109/ICFHR.2016.0052
Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018). https://doi.org/10.1007/978-3-030-01231-1_23
Zhang, H., Liang, L., Jin, L.: SCUT-HCCDoc: a new benchmark dataset of handwritten Chinese text in unconstrained camera-captured documents. Pattern Recognit. 107559 (2020). https://doi.org/10.1016/j.patcog.2020.107559

Download references

Acknowledgement

We thank Joan Puigcerver and Carlos Mocholí for implementing PyLaia and for allowing us to contribute. We also thank Stefan Weil for his recent contributions to PyLaia. This work was supported by the Research Council of Norway through the 328598 IKTPLUSS HuginMunin project.

Author information

Authors and Affiliations

TEKLIA, Paris, France
Solène Tarride, Yoann Schneider, Marie Generali-Lince, Mélodie Boillet, Bastien Abadie & Christopher Kermorvant

Authors

Solène Tarride
View author publications
You can also search for this author in PubMed Google Scholar
Yoann Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Marie Generali-Lince
View author publications
You can also search for this author in PubMed Google Scholar
Mélodie Boillet
View author publications
You can also search for this author in PubMed Google Scholar
Bastien Abadie
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Kermorvant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Solène Tarride .

Editor information

Editors and Affiliations

Luleå Tekniska Universitet, Luleå, Sweden
Elisa H. Barney Smith
Luleå Tekniska Universitet, Luleå, Sweden
Marcus Liwicki
Tsinghua University, Beijing, China
Liangrui Peng

Apprendix

1.1 7.1 Impact of Language Models on Speed

During the inference process, computations can take advantage of GPU acceleration up to the language model decoder, which is unfortunately limited to CPU execution. As a result, the inclusion of a language model in the decoding process improves the overall results, but results in a significant slowdown of the prediction speed. Specifically, the decoding speed of PyLaia experiences a tenfold reduction when a language model is applied, as detailed in the Table 6. Therefore, we recommend using language models for batch processing of documents, while cautioning against their use in real-time scenarios due to the associated computational overhead.

Table 6. Impact of language models on decoding time (s/image). Experiments are carried out on an NVIDIA GeForce RTX 3080 Ti GPU.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tarride, S., Schneider, Y., Generali-Lince, M., Boillet, M., Abadie, B., Kermorvant, C. (2024). Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14808. Springer, Cham. https://doi.org/10.1007/978-3-031-70549-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-70549-6_23
Published: 09 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70548-9
Online ISBN: 978-3-031-70549-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian

PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

Multilingual Handwritten Text Recognition (MultiHTR) or Reading Your Grandma’s Old Letters in German, Russian, Serbian, and Ottoman Turkish with Artificial Intelligence

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Apprendix

1.1 7.1 Impact of Language Models on Speed

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Comprehensive Comparison of Open-Source Libraries for Handwritten Text Recognition in Norwegian

PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset

Multilingual Handwritten Text Recognition (MultiHTR) or Reading Your Grandma’s Old Letters in German, Russian, Serbian, and Ottoman Turkish with Artificial Intelligence

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Apprendix

Apprendix

1.1 7.1 Impact of Language Models on Speed

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation