Abstract
PyLaia is one of the most popular open-source software for Automatic Text Recognition (ATR), delivering strong performance in terms of speed and accuracy. In this paper, we outline our recent contributions to the PyLaia library, focusing on the incorporation of reliable confidence scores and the integration of statistical language modeling during decoding. Our implementation provides an easy way to combine PyLaia with n-grams language models at different levels. One of the highlights of this work is that language models are completely auto-tuned: they can be built and used easily without any expert knowledge, and without requiring any additional data. To demonstrate the significance of our contribution, we evaluate PyLaia’s performance on twelve datasets, both with and without language modelling. The results show that decoding with small language models improves the Word Error Rate by 13% and the Character Error Rate by 12% in average. Additionally, we conduct an analysis of confidence scores and highlight the importance of calibration techniques. Our implementation is publicly available in the official PyLaia repository (https://gitlab.teklia.com/atr/pylaia), and twelve open-source models are released on Hugging Face (https://huggingface.co/collections/Teklia/pylaia-65f16e9ae0aa03690e9e9f80).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
References
Beyer, Y., Solberg, P.E.: NorHand v3/Dataset for Handwritten Text Recognition in Norwegian (2023). https://doi.org/10.5281/zenodo.10255840
Beyer, Y., Solberg, P.E.: Norhand v2/Dataset for Handwritten Text Recognition in Norwegian [Data Set] (2024). https://doi.org/10.5281/zenodo.10555698
Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural Optical Understanding for Academic Documents (2023)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–394 (1999). https://doi.org/10.1006/csla.1999.0128
Constum, T., et al.: Recognition and information extraction in historical handwritten tables: toward understanding early 20th century Paris census. In: Uchida, S., Barney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 143–157. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_10
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17 (2023). https://doi.org/10.1109/tpami.2023.3235826
Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 508–524 (2023). https://doi.org/10.1109/TPAMI.2022.3144899
Diaz, D.H., Qin, S., Ingle, R.R., Fujii, Y., Bissacco, A.: Rethinking text line recognition models. arXiv preprint arXiv:2104.07787 (2021)
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of the 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 1050–1059. PMLR, New York (2016). https://proceedings.mlr.press/v48/gal16.html
Grosicki, E., El-Abed, H.: ICDAR 2011 - French Handwriting Recognition Competition. In: 2011 International Conference on Document Analysis and Recognition, pp. 1459–1463 (2011). https://doi.org/10.1109/ICDAR.2011.290
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML 2017), vol. 70, pp. 1321–1330 JMLR.org (2017). https://doi.org/10.5555/3305381.3305518
Heafield, K.: KenLM: faster and smaller language model queries. In: Callison-Burch, C., Koehn, P., Monz, C., Zaidan, O.F. (eds.) Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics, Edinburgh (2011)
Kiessling, B.: The Kraken OCR System. https://kraken.re
Kuang, Z., et al.: MMOCR: a comprehensive toolbox for text detection, recognition and understanding. arXiv preprint arXiv:2108.06543 (2021)
Kumar, S., Nirschl, M., Holtmann-Rice, D., Liao, H., Suresh, A.T., Yu, F.: Lattice rescoring strategies for long short term memory language models in speech recognition. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 165–172 (2017). https://doi.org/10.1109/ASRU.2017.8268931
Li, M., et al.: Trocr: transformer-based optical character recognition with pre-trained models. In: AAAI Conference on Artificial Intelligence (2021). https://doi.org/10.1609/aaai.v37i11.26538
Liu, C.L., Yin, F., Wang, D.H., Wang, Q.F.: CASIA online and offline Chinese handwriting databases. In: 2011 International Conference on Document Analysis and Recognition, pp. 37–41 (2011). https://doi.org/10.1109/ICDAR.2011.17
Maarand, M., Beyer, Y., Kåsen, A., Fosseide, K.T., Kermorvant, C.: A comprehensive comparison of open-source libraries for handwritten text recognition in Norwegian. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, 22–25 May 2022, pp. 399–413. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-06555-2_27
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. 5, 39–46 (2002)
Muehlberger, G., Hackl, G.: NewsEye/READ OCR training dataset from Austrian Newspapers (19th C.) (2019). https://doi.org/10.5281/zenodo.3387369
Neto, A.F.S., Bezerra, B.L.D., Toselli, A.H., Lima, E.B.: HTR-Flor++: a handwritten text recognition system based on a pipeline of optical and language models. In: Proceedings of the ACM Symposium on Document Engineering 2020 (DocEng 2020). Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3395027.3419603
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (2011)
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 67–72 (2017). https://doi.org/10.1109/ICDAR.2017.20
Puigcerver, J., Mocholí, C.: Pylaia. Commit SHA (2018). https://github.com/jpuigcerver/PyLaia/
Romero, V., et al.: The ESPOSALLES database: an ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn. 46(6), 1658–1669 (2013). https://doi.org/10.1016/j.patcog.2012.11.024
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Interspeech (2002)
Stutzmann, D., Hamel, S., Kernier, I.D., Mühlberger, G., Hackl, G.: HIMANIS Guérin, Type: dataset (2021). https://doi.org/10.5281/zenodo.5535306
Stutzmann, D., Torres Aguilar, S., Chaffenet, P.: HOME-Alcar: aligned and annotated cartularies (2021). https://doi.org/10.5281/zenodo.5600884. Type: dataset
Tarride, S., Boillet, M., Kermorvant, C.: Key-value information extraction from full handwritten pages. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) Document Analysis and Recognition (ICDAR 2023), pp. 185–204. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_11
Tarride, S., Faine, T., Boillet, M., Mouchère, H., Kermorvant, C.: The belfort dataset: handwritten text recognition from crowdsourced annotations (2023). https://doi.org/10.5281/zenodo.8041668
Tassopoulou, V., Retsinas, G., Maragos, P.: Enhancing handwritten text recognition with N-gram sequence decomposition and multitask learning. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 10555–10560. IEEE Computer Society, Los Alamitos (2021). https://doi.org/10.1109/ICPR48806.2021.9412351
Voigtlaender, P., Doetsch, P., Ney, H.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 228–233 (2016). https://doi.org/10.1109/ICFHR.2016.0052
Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, follow, read: end-to-end full-page handwriting recognition. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018). https://doi.org/10.1007/978-3-030-01231-1_23
Zhang, H., Liang, L., Jin, L.: SCUT-HCCDoc: a new benchmark dataset of handwritten Chinese text in unconstrained camera-captured documents. Pattern Recognit. 107559 (2020). https://doi.org/10.1016/j.patcog.2020.107559
Acknowledgement
We thank Joan Puigcerver and Carlos Mocholí for implementing PyLaia and for allowing us to contribute. We also thank Stefan Weil for his recent contributions to PyLaia. This work was supported by the Research Council of Norway through the 328598 IKTPLUSS HuginMunin project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Apprendix
Apprendix
1.1 7.1 Impact of Language Models on Speed
During the inference process, computations can take advantage of GPU acceleration up to the language model decoder, which is unfortunately limited to CPU execution. As a result, the inclusion of a language model in the decoding process improves the overall results, but results in a significant slowdown of the prediction speed. Specifically, the decoding speed of PyLaia experiences a tenfold reduction when a language model is applied, as detailed in the Table 6. Therefore, we recommend using language models for batch processing of documents, while cautioning against their use in real-time scenarios due to the associated computational overhead.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tarride, S., Schneider, Y., Generali-Lince, M., Boillet, M., Abadie, B., Kermorvant, C. (2024). Improving Automatic Text Recognition with Language Models in the PyLaia Open-Source Library. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14808. Springer, Cham. https://doi.org/10.1007/978-3-031-70549-6_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-70549-6_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70548-9
Online ISBN: 978-3-031-70549-6
eBook Packages: Computer ScienceComputer Science (R0)