A Comparison of Character-Based Neural Machine Translations Techniques Applied to Spelling Normalization

Miguel Domingo¹⁶ &
Francisco Casacuberta¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12667))

Included in the following conference series:

International Conference on Pattern Recognition

2058 Accesses

Abstract

The lack of spelling conventions and the natural evolution of human language create a linguistic barrier inherent in historical documents. This barrier has always been a concern for scholars in humanities. In order to tackle this problem, spelling normalization aims to adapt a document’s orthography to modern standards. In this work, we evaluate several character-based neural machine translation normalization approaches—using modern documents to enrich the neural models. We evaluated these approaches on several datasets from different languages and time periods, reaching the conclusion that each approach is better suited for a different set of documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents

How Much Does Tokenization Affect Neural Machine Translation?

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2015)
Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)
Google Scholar
Bollmann, M.: Normalization of historical texts with neural network models. Ph.D. thesis, Sprachwissenschaftliches Institut, Ruhr-Universität (2018)
Google Scholar
Bollmann, M., Søgaard, A.: Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In: Proceedings of the International Conference on the Computational Linguistics, pp. 131–139 (2016)
Google Scholar
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Chatterjee, R., Farajian, M.A., Negri, M., Turchi, M., Srivastava, A., Pal, S.: Multi-source neural automatic post-editing: FBK’s participation in the WMT 2017 APE shared task. In: Proceedings of the Second Conference on Machine Translation, pp. 630–638 (2017)
Google Scholar
Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1693–1703 (2016)
Google Scholar
Costa-Jussà, M.R., Aldón, D., Fonollosa, J.A.: Chinese-Spanish neural machine translation enhanced with character and word bitmap fonts. Mach. Transl. 31, 35–47 (2017)
Article Google Scholar
Costa-Jussà, M.R., Fonollosa, J.A.: Character-based neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 357–361 (2016)
Google Scholar
Domingo, M., Casacuberta, F.: Spelling normalization of historical documents by using a machine translation approach. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 129–137 (2018)
Google Scholar
Domingo, M., Casacuberta, F.: Enriching character-based neural machine translation with modern documents for achieving an orthography consistency in historical documents. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11808, pp. 59–69. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30754-7_7
Chapter Google Scholar
Domingo, M., et al.: A user study of the incremental learning in NMT. In: Proceedings of the European Association for Machine Translation, pp. 319–328 (2020)
Google Scholar
Jehle, F.: Works of Miguel de Cervantes in Old- and Modern-Spelling. Indiana University Purdue University, Fort Wayne (2001)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. arXiv:1705.03122 (2017)
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Article Google Scholar
Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., Mäkelä, E.: Normalizing early English letters to present-day English spelling. In: Proceedings of the Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 87–96 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of the Association for Computational Linguistics: System Demonstration, pp. 67–72 (2017)
Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 177–180 (2007)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 48–54 (2003)
Google Scholar
Korchagina, N.: Normalizing medieval German texts: from rules to deep learning. In: Proceedings of the Nordic Conference on Computational Linguistics Workshop on Processing Historical Language, pp. 12–17 (2017)
Google Scholar
Laing, M.: The linguistic analysis of medieval vernacular texts: two projects at Edinburgh. In: Rissanen, M., Kytd, M., Wright, S. (eds.) Corpora Across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, vol. 25427, pp. 121–141. St Catharine’s College, Cambridge (1993)
Google Scholar
Ling, W., Trancoso, I., Dyer, C., Black, A.W.: Character-based neural machine translation. arXiv preprint arXiv:1511.04586 (2015)
Lison, P., Tiedemann, J.: Opensubtitles 2016: extracting large parallel corpora from movie and tv subtitles. In: Proceedings of the International Conference on Language Resources Association, pp. 923–929 (2016)
Google Scholar
Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T.: Dataset of normalised Slovene text KonvNormSl 1.0. Slovenian language resource repository CLARIN. SI (2016). http://hdl.handle.net/11356/1068
Ljubešic, N., Zupan, K., Fišer, D., Erjavec, T.: Normalising Slovene data: historical texts vs. user-generated content. In: Proceedings of the Conference on Natural Language Processing, pp. 146–155 (2016)
Google Scholar
Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 301–305 (2012)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 160–167 (2003)
Google Scholar
Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 295–302 (2002)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in old Spanish. In: Proceedings of the Workshop on Computational Historical Linguistics, pp. 70–79 (2013)
Google Scholar
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation, pp. 186–191 (2018)
Google Scholar
Riezler, S., Maxwell, J.T.: On some pitfalls in automatic evaluation and significance testing for MT. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 57–64 (2005)
Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Article Google Scholar
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the Workshop on Balto-Slavic Natural Language Processing, pp. 58–62 (2013)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725 (2016)
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas, pp. 223–231 (2006)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 257–286 (2002)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, vol. 27, pp. 3104–3112 (2014)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Tang, G., Cap, F., Pettersson, E., Nivre, J.: An evaluation of neural machine translation models on historical spelling normalization. In: Proceedings of the International Conference on Computational Linguistics, pp. 1320–1331 (2018)
Google Scholar
Tiedemann, J.: Character-based PSMT for closely related languages. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 12–19 (2009)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144 (2016)
Zens, R., Och, F.J., Ney, H.: Phrase-based statistical machine translation. In: Jarke, M., Lakemeyer, G., Koehler, J. (eds.) KI 2002. LNCS (LNAI), vol. 2479, pp. 18–32. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45751-8_2
Chapter Google Scholar

Download references

Acknowledgments

The research leading to these results has received funding from the European Union through Programa Operativo del Fondo Europeo de Desarrollo Regional (FEDER) from Comunitat Valenciana (2014–2020) under project IDIFEDER/2018/025; from Ministerio de Economía y Competitividad under project PGC2018-096212-B-C31; and from Generalitat Valenciana (GVA) under project PROMETEO/2019/121.We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU used for part of this research.

Author information

Authors and Affiliations

PRHLT Research Center, Universitat Politècnica de València, Valencia, Spain
Miguel Domingo & Francisco Casacuberta

Authors

Miguel Domingo
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Casacuberta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel Domingo .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Domingo, M., Casacuberta, F. (2021). A Comparison of Character-Based Neural Machine Translations Techniques Applied to Spelling Normalization. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12667. Springer, Cham. https://doi.org/10.1007/978-3-030-68787-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-68787-8_24
Published: 21 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68786-1
Online ISBN: 978-3-030-68787-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Comparison of Character-Based Neural Machine Translations Techniques Applied to Spelling Normalization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents

How Much Does Tokenization Affect Neural Machine Translation?

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

A Comparison of Character-Based Neural Machine Translations Techniques Applied to Spelling Normalization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents

How Much Does Tokenization Affect Neural Machine Translation?

Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation