[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Abstract

PyLaia is one of the most popular open-source software for Automatic Text Recognition (ATR), delivering strong performance in terms of speed and accuracy. In this paper, we outline our recent contributions to the PyLaia library, focusing on the incorporation of reliable confidence scores and the integration of statistical language modeling during decoding. Our implementation provides an easy way to combine PyLaia with n-grams language models at different levels. One of the highlights of this work is that language models are completely auto-tuned: they can be built and used easily without any expert knowledge, and without requiring any additional data. To demonstrate the significance of our contribution, we evaluate PyLaia’s performance on twelve datasets, both with and without language modelling. The results show that decoding with small language models improves the Word Error Rate by 13% and the Character Error Rate by 12% in average. Additionally, we conduct an analysis of confidence scores and highlight the importance of calibration techniques. Our implementation is publicly available in the official PyLaia repository (https://gitlab.teklia.com/atr/pylaia), and twelve open-source models are released on Hugging Face (https://huggingface.co/collections/Teklia/pylaia-65f16e9ae0aa03690e9e9f80).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 49.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 64.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://readcoop.eu/transkribus/.

  2. 2.

    http://www.transkriptorium.com/.

  3. 3.

    https://escriptorium.fr/.

  4. 4.

    https://doc.arkindex.org/.

  5. 5.

    https://gitlab.teklia.com/atr/pylaia.

  6. 6.

    http://www.speech.sri.com/projects/srilm/.

  7. 7.

    https://github.com/kpu/kenlm.

  8. 8.

    https://atr.pages.teklia.com/pylaia/.

  9. 9.

    https://huggingface.co/Teklia.

  10. 10.

    https://github.com/kaldi-asr/kaldi.

  11. 11.

    https://github.com/mittagessen/kraken.

  12. 12.

    https://gitlab.teklia.com/atr/pylaia.

  13. 13.

    https://github.com/arthurflor23/handwritten-text-recognition.

  14. 14.

    https://black.readthedocs.io/en/stable/.

  15. 15.

    https://pycqa.github.io/isort/.

  16. 16.

    https://docs.astral.sh/ruff/.

  17. 17.

    https://docs.pytest.org/en/7.4.x/.

  18. 18.

    https://tox.wiki/en/latest/index.html.

  19. 19.

    https://atr.pages.teklia.com/pylaia/.

  20. 20.

    https://mkdocs.readthedocs.io/en/stable/.

  21. 21.

    https://pypi.org/project/pylaia/.

  22. 22.

    https://gitlab.teklia.com/atr/pylaia/-/releases.

  23. 23.

    https://atr.pages.teklia.com/pylaia/releases/.

  24. 24.

    https://www.docker.com/.

  25. 25.

    https://pytorch.org/audio/main/generated/torchaudio.models.decoder.ctc_decoder.html.

  26. 26.

    https://atr.pages.teklia.com/pylaia/usage.

  27. 27.

    https://atr.pages.teklia.com/pylaia/get_started/development/.

  28. 28.

    https://huggingface.co/Teklia/.

  29. 29.

    https://parquet.apache.org.

  30. 30.

    https://demo.arkindex.org/browse/5000e248-a624-4df1-8679-1b34679817ef?top_level=true&folder=true.

References

  1. Beyer, Y., Solberg, P.E.: NorHand v3/Dataset for Handwritten Text Recognition in Norwegian (2023). https://doi.org/10.5281/zenodo.10255840

  2. Beyer, Y., Solberg, P.E.: Norhand v2/Dataset for Handwritten Text Recognition in Norwegian [Data Set] (2024). https://doi.org/10.5281/zenodo.10555698

  3. Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural Optical Understanding for Academic Documents (2023)

    Google Scholar 

  4. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999). https://doi.org/10.1006/csla.1999.0128

    Article  Google Scholar 

  5. Constum, T., et al.: Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census. In: Uchida, S., Barney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 143–157. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_10

  6. Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17 (2023). https://doi.org/10.1109/tpami.2023.3235826

  7. Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 508–524 (2023). https://doi.org/10.1109/TPAMI.2022.3144899

    Article  Google Scholar 

  8. Diaz, D.H., Qin, S., Ingle, R.R., Fujii, Y., Bissacco, A.: Rethinking text line recognition models. arXiv preprint arXiv:2104.07787 (2021)

  9. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of the 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 1050–1059. PMLR, New York (2016). https://proceedings.mlr.press/v48/gal16.html

  10. Grosicki, E., El-Abed, H.: ICDAR 2011 - French Handwriting Recognition Competition. In: 2011 International Conference on Document Analysis and Recognition, pp. 1459–1463 (2011). https://doi.org/10.1109/ICDAR.2011.290

  11. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML 2017), vol. 70, pp. 1321–1330 JMLR.org (2017). https://doi.org/10.5555/3305381.3305518

  12. Heafield, K.: KenLM: faster and smaller language model queries. In: Callison-Burch, C., Koehn, P., Monz, C., Zaidan, O.F. (eds.) Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics, Edinburgh (2011)

    Google Scholar 

  13. Kiessling, B.: The Kraken OCR System. https://kraken.re

  14. Kuang, Z., et al.: MMOCR: a comprehensive toolbox for text detection, recognition and understanding. arXiv preprint arXiv:2108.06543 (2021)

  15. Kumar, S., Nirschl, M., Holtmann-Rice, D., Liao, H., Suresh, A.T., Yu, F.: Lattice rescoring strategies for long short term memory language models in speech recognition. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 165–172 (2017). https://doi.org/10.1109/ASRU.2017.8268931

  16. Li, M., et al.: Trocr: transformer-based optical character recognition with pre-trained models. In: AAAI Conference on Artificial Intelligence (2021). https://doi.org/10.1609/aaai.v37i11.26538

  17. Liu, C.L., Yin, F., Wang, D.H., Wang, Q.F.: CASIA online and offline Chinese handwriting databases. In: 2011 International Conference on Document Analysis and Recognition, pp. 37–41 (2011). https://doi.org/10.1109/ICDAR.2011.17

  18. Maarand, M., Beyer, Y., Kåsen, A., Fosseide, K.T., Kermorvant, C.: A comprehensive comparison of open-source libraries for handwritten text recognition in Norwegian. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, 22–25 May 2022, pp. 399–413. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-06555-2_27

  19. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. 5, 39–46 (2002)

    Google Scholar 

  20. Muehlberger, G., Hackl, G.: NewsEye/READ OCR training dataset from Austrian Newspapers (19th C.) (2019). https://doi.org/10.5281/zenodo.3387369

  21. Neto, A.F.S., Bezerra, B.L.D., Toselli, A.H., Lima, E.B.: HTR-Flor++: a handwritten text recognition system based on a pipeline of optical and language models. In: Proceedings of the ACM Symposium on Document Engineering 2020 (DocEng 2020). Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3395027.3419603

  22. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011)

    Google Scholar 

  23. Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 67–72 (2017). https://doi.org/10.1109/ICDAR.2017.20

  24. Puigcerver, J., Mocholí, C.: Pylaia. Commit SHA (2018). https://github.com/jpuigcerver/PyLaia/

  25. Romero, V., et al.: The ESPOSALLES database: an ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn. 46(6), 1658–1669 (2013). https://doi.org/10.1016/j.patcog.2012.11.024

  26. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Interspeech (2002)

    Google Scholar 

  27. Stutzmann, D., Hamel, S., Kernier, I.D., Mühlberger, G., Hackl, G.: HIMANIS Guérin, Type: dataset (2021). https://doi.org/10.5281/zenodo.5535306

  28. Stutzmann, D., Torres Aguilar, S., Chaffenet, P.: HOME-Alcar: aligned and annotated cartularies (2021). https://doi.org/10.5281/zenodo.5600884. Type: dataset

  29. Tarride, S., Boillet, M., Kermorvant, C.: Key-value information extraction from full handwritten pages. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) Document Analysis and Recognition (ICDAR 2023), pp. 185–204. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_11

  30. Tarride, S., Faine, T., Boillet, M., Mouchère, H., Kermorvant, C.: The belfort dataset: handwritten text recognition from crowdsourced annotations (2023). https://doi.org/10.5281/zenodo.8041668

  31. Tassopoulou, V., Retsinas, G., Maragos, P.: Enhancing handwritten text recognition with N-gram sequence decomposition and multitask learning. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 10555–10560. IEEE Computer Society, Los Alamitos (2021). https://doi.org/10.1109/ICPR48806.2021.9412351

  32. Voigtlaender, P., Doetsch, P., Ney, H.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233 (2016). https://doi.org/10.1109/ICFHR.2016.0052

  33. Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018). https://doi.org/10.1007/978-3-030-01231-1_23

  34. Zhang, H., Liang, L., Jin, L.: SCUT-HCCDoc: a new benchmark dataset of handwritten Chinese text in unconstrained camera-captured documents. Pattern Recognit. 107559 (2020). https://doi.org/10.1016/j.patcog.2020.107559

Download references

Acknowledgement

We thank Joan Puigcerver and Carlos Mocholí for implementing PyLaia and for allowing us to contribute. We also thank Stefan Weil for his recent contributions to PyLaia. This work was supported by the Research Council of Norway through the 328598 IKTPLUSS HuginMunin project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Solène Tarride .

Editor information

Editors and Affiliations

Apprendix

Apprendix

1.1 7.1 Impact of Language Models on Speed

During the inference process, computations can take advantage of GPU acceleration up to the language model decoder, which is unfortunately limited to CPU execution. As a result, the inclusion of a language model in the decoding process improves the overall results, but results in a significant slowdown of the prediction speed. Specifically, the decoding speed of PyLaia experiences a tenfold reduction when a language model is applied, as detailed in the Table 6. Therefore, we recommend using language models for batch processing of documents, while cautioning against their use in real-time scenarios due to the associated computational overhead.

Table 6. Impact of language models on decoding time (s/image). Experiments are carried out on an NVIDIA GeForce RTX 3080 Ti GPU.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tarride, S., Schneider, Y., Generali-Lince, M., Boillet, M., Abadie, B., Kermorvant, C. (2024). Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14808. Springer, Cham. https://doi.org/10.1007/978-3-031-70549-6_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70549-6_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70548-9

  • Online ISBN: 978-3-031-70549-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics