Pre-Training on Mixed Data for Low-Resource Neural Machine Translation
<p>The examples of the cross-lingual language model (XLM) and our method based on XLM. (<b>a</b>) XLM randomly masks some tokens and then predicts the masked tokens by unmasked tokens and (<b>b</b>) our method based on XLM replaces some unmasked tokens with its translation tokens.</p> "> Figure 2
<p>The examples of masked sequence to sequence (MASS) and our method based on MASS. (<b>a</b>) MASS randomly masks contiguous tokens and then predicts the masked tokens by unmasked tokens and (<b>b</b>) our method based on MASS replaces some unmasked tokens with its translation tokens.</p> "> Figure 3
<p>The model architecture of word translation model. The embedding layer and linear layer are initialized by the cross-lingual word embedding, and their parameters are frozen on the word translation task.</p> "> Figure 4
<p>The perplexity (PPL) scores of different pre-training models on the validation set with respect to the epoch during pre-training. (<b>a</b>) the Uyghur results of XLM and our method based on XLM; (<b>b</b>) the Chinese results of XLM and our method based on XLM; (<b>c</b>) the Uyghur results of MASS and our method based on MASS; and (<b>d</b>) the Chinese results of MASS and our method based on MASS.</p> ">
Abstract
:1. Introduction
- We propose a new pre-training method, which can exploit both monolingual and bilingual data to train the pre-training model;
- We conduct experiments on Uyghur-Chinese corpus, which is a realistic low resource scenario. Experimental results show that the pre-training model in our method has a better generalization ability and improves the performance of the translation model;
- We propose a word translation model to measure the alignment knowledge contained in the embedding of other models;
- The word translation model mentioned above proves that our method allows translation models to learn more alignment knowledge.
2. Related Works
3. Our Method
4. Word Translation Model
5. Experiments and Results
5.1. Datasets and Preprocessing
5.2. Systems and Model Configurations
- No pre-training: All parameters in the translation model are initialized randomly as in the traditional translation model;
- XLM [19]: We train the pre-training model through masked language modeling as described in XLM;
- XLM + our method: On the basis of XLM, we train the pre-training model by further replacing the unmasked words according to our method mentioned above;
- MASS [20]: We train the pre-training model through masked sequence to sequence modeling as described in MASS;
- MASS + our method: On the basis of MASS, we train the pre-training model by further replacing the unmasked words according to our method mentioned above.
5.3. Results
5.4. Analysis
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Macherey, M.N.W.; Krikun, J.M.; Cao, Y.; Cao, Q.; Macherey, K.; Klinger, J.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C.; Huang, X.; Junczys-Dowmunt, M.; Lewis, W.; Li, M. Achieving human parity on automatic chinese to english news translation. arXiv 2018, arXiv:1803.05567. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Luong, M.T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zoph, B.; Yuret, D.; May, J.; Knight, K. Transfer learning for low-resource neural machine translation. arXiv 2016, arXiv:1604.02201. [Google Scholar]
- Nguyen, T.Q.; Chiang, D. Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation. arXiv 2017, arXiv:1708.09803. [Google Scholar]
- Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, H.; Bougares, F.; Schwenk, H.; Bengio, Y. On using monolingual corpora in neural machine translation. arXiv 2015, arXiv:1503.03535. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. Available online: https://arxiv.org/abs/1511.06709 (accessed on 18 March 2021).
- Zhang, J.; Zong, C. Exploiting source-side monolingual data in neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016. [Google Scholar]
- Xia, Y.; He, D.; Qin, T.; Wang, L.; Yu, N.; Liu, T.Y.; Ma, W.Y. Dual learning for machine translation. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 6–12 December 2018. [Google Scholar]
- Cheng, Y.; Xu, W.; He, Z.; He, W.; Wu, H.; Sun, M.; Liu, Y. Semi-Supervised Learning for Neural Machine Translation. Available online: https://link.springer.com/chapter/10.1007/978-981-32-9748-7_3 (accessed on 18 March 2021).
- Gu, J.; Wang, Y.; Chen, Y.; Cho, K.; Li, V.O.K. Meta-Learning for Low-Resource Neural Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Lample, G.; Conneau, A. Crosslingual language model pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. Mass: Masked sequence to sequence pre-training for language generation. arXiv 2019, arXiv:1905.02450. [Google Scholar]
- Brown, P.F.; Cocke, J.; Della Pietra, S.A. A statistical approach to machine translation. Comput. Linguist. 1990, 16, 79–85. [Google Scholar]
- Skorokhodov, I.; Rykachevskiy, A.; Emelyanenko, D.; Slotin, S.; Ponkratov, A. Semi-Supervised Neural Machine Translation with Language Models. Available online: https://www.aclweb.org/anthology/W18-2205.pdf (accessed on 18 March 2021).
- Zhu, J.; Xia, Y.; Wu, L. Incorporating bert into neural machine translation. arXiv 2020, arXiv:2002.06823. [Google Scholar]
- Guo, J.; Zhang, Z.; Xu, L. Incorporating BERT into Parallel Sequence Decoding with Adapters. arXiv 2020, arXiv:2010.06138. [Google Scholar]
- Brown, P.F.; Della Pietra, S.A.; Della Pietra, V.J.; Mercer, R.L. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 1993, 19, 263–311. [Google Scholar]
- Dyer, C.; Chahuneau, V.; Smith, N.A. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 644–648. [Google Scholar]
- Koehn, P.; Hoang, H.; Birch, A.; Burch, C.C.; Federico, M.; Bertoldi, N.; Corwan, B.; Shen, W.; Moran, C.; Zens, R.; et al. Moses: Open Source Toolkit for Statistical Machine Translation. Available online: https://www.aclweb.org/anthology/P07-2.pdf (accessed on 18 March 2021).
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Diederik, K.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Nitish, S.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, T. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. Available online: https://www.aclweb.org/anthology/P02-1040.pdf (accessed on 18 March 2021).
Dataset | Training Size | Validation Size | Test Size |
---|---|---|---|
Uyghur-Chinese para | 0.17 M | 1 k | 1 k |
Uyghur mono | 2.45 M | - | - |
Chinese mono | 2.62 M | - | - |
Training Data | System | uy-zh | zh-uy |
---|---|---|---|
bilingual data | no pre-training | 27.50 | 22.28 |
bilingual + monolingual data | XLM | 31.31 | 24.80 |
XLM + our method | 31.83 | 25.51 | |
MASS | 31.34 | 24.58 | |
MASS + our method | 31.61 | 25.27 |
Total Num | Training Num | Validation Num | Test Num |
---|---|---|---|
35,159 | 30,000 | 1159 | 4000 |
Embedding | fastText | XLM | XLM + Our Method | MASS | MASS + Our Method |
---|---|---|---|---|---|
PPL | 19,610 | 31.57 | 5.88 | 52.93 | 10.67 |
NMT | No Pre-Training | XLM | XLM + Our Method | MASS | MASS + Our Method |
---|---|---|---|---|---|
PPL | 52.80 | 14.63 | 4.99 | 25.11 | 8.23 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, W.; Li, X.; Yang, Y.; Dong, R. Pre-Training on Mixed Data for Low-Resource Neural Machine Translation. Information 2021, 12, 133. https://doi.org/10.3390/info12030133
Zhang W, Li X, Yang Y, Dong R. Pre-Training on Mixed Data for Low-Resource Neural Machine Translation. Information. 2021; 12(3):133. https://doi.org/10.3390/info12030133
Chicago/Turabian StyleZhang, Wenbo, Xiao Li, Yating Yang, and Rui Dong. 2021. "Pre-Training on Mixed Data for Low-Resource Neural Machine Translation" Information 12, no. 3: 133. https://doi.org/10.3390/info12030133
APA StyleZhang, W., Li, X., Yang, Y., & Dong, R. (2021). Pre-Training on Mixed Data for Low-Resource Neural Machine Translation. Information, 12(3), 133. https://doi.org/10.3390/info12030133