More Web Proxy on the site http://driver.im/

research-article

Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning

Authors:

Zhengtao YuAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 6

Article No.: 176, Pages 1 - 18

https://doi.org/10.1145/3589341

Published: 16 June 2023 Publication History

Abstract

Cross-lingual sentence embedding’s goal is mapping sentences with similar semantics but in different languages close together and dissimilar sentences farther apart in the representation space. It is the basis of many downstream tasks such as cross-lingual document matching and cross-lingual summary extraction. At present, the works of cross-lingual sentence embedding tasks mainly focus on languages with large-scale corpus. But low-resource languages such as Chinese-Vietnamese are short of sentence-level parallel corpora and clear cross-lingual monitoring signals, and these works on low-resource languages have poor performances. Therefore, we propose a cross-lingual sentence embedding method based on contrastive learning and effectively fine-tune powerful pretraining mode by constructing sentence-level positive and negative samples to avoid the catastrophic forgetting problem of the traditional fine-tuning pre-trained model based only on small-scale aligned positive samples. First, we construct positive and negative examples by taking parallel Chinese Vietnamese sentences as positive examples and non-parallel sentences as negative examples. Second, we construct a siamese network to get contrastive loss by inputting positive and negative samples and fine-tuning our model. The experimental results show that our method can effectively improve the semantic alignment accuracy of cross-lingual sentence embedding in Chinese and Vietnamese contexts.

References

[1]

Quoc V. Le and Tomás Mikolov. 2014. Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014).

[2]

Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. CoRR abs/1703.02507 (2017).

[3]

Ahmed El-Kishky and Francisco Guzmán. 2020. Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 616–625. Retrieved from https://aclanthology.org/2020.aacl-main.62/.

[4]

Yu Bai, Yang Gao, and Heyan Huang. 2021. Cross-lingual abstractive summarization with limited parallel resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 6910–6924. DOI:

[5]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 3980–3990. DOI:

[6]

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Representation degeneration problem in training natural language generation models. In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=SkEYojRqtm.

[7]

Jindrich Libovický, Rudolf Rosa, and Alexander Fraser. 2020. On the language neutrality of pre-trained multilingual representations. In Findings of the Association for Computational Linguistics(Findings of ACL). Association for Computational Linguistics, 1663–1674. DOI:

[8]

Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, and Kyle Mahowald. 2021. Deep subjecthood: Higher-order grammatical features in multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2522–2532. DOI:

[9]

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 4996–5001. DOI:

[10]

Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Ling. 7 (2019), 597–610. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/1742.

[11]

Phoebe Mulcaire, Jungo Kasai, and Noah A. Smith. 2019. Polyglot contextual representations improve crosslingual transfer. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 3912–3918. DOI:

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186. DOI:

[13]

Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019).

[14]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 8440–8451. DOI:

[15]

Xin Liu, Baosong Yang, Dayiheng Liu, Haibo Zhang, Weihua Luo, Min Zhang, Haiying Zhang, and Jinsong Su. 2021. Bridging subword gaps in pretrain-finetune paradigm for natural language generation. CoRR abs/2106.06125 (2021).

[16]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A simple framework for contrastive learning of visual representations. CoRR abs/2002.05709 (2020).

[17]

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. CoRR abs/2004.11362 (2020).

[18]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. CoRR abs/2104.08821 (2021).

[19]

Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-Wen Li, Scott Yih, Yoon Kim, and James R. Glass. 2022. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4207–4218. DOI:

[20]

Huan Lin, Liang Yao, Baosong Yang, Dayiheng Liu, Haibo Zhang, Weihua Luo, Degen Huang, and Jinsong Su. 2021. Towards user-driven neural machine translation. CoRR abs/2106.06200 (2021).

[21]

Jane Bromley, James W. Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a “Siamese” time delay neural network. Int. J. Pattern Recog. Artif. Intell. 7, 4 (1993), 669–688. DOI:

[22]

Zhen Wang, Xiangxie Zhang, and Yicong Tan. 2021. Chinese sentences similarity via cross-attention based siamese network. CoRR abs/2104.08787 (2021).

[23]

Andreas Rücklé, Steffen Eger, Maxime Peyrard, and Iryna Gurevych. 2018. Concatenated p-mean word embeddings as universal cross-lingual sentence representations. CoRR abs/1803.01400 (2018).

[24]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998–6008. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

[25]

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012).

[26]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019).

[27]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. CoRR abs/2010.11934 (2020).

[28]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116 (2019).

[29]

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Retrieved from https://hal.inria.fr/hal-02131630.

Cited By

Dai GLiao JZhao SFu XPeng XHuang HZhang B(2025)Large Language Model Enhanced Logic Tensor Network for Stance DetectionNeural Networks10.1016/j.neunet.2024.106956183(106956)Online publication date: Mar-2025
https://doi.org/10.1016/j.neunet.2024.106956
HUANG YYANG YZHU ELIANG YXIAN Y(2024)Integrating Event Elements for Chinese-Vietnamese Cross-Lingual Event RetrievalIEICE Transactions on Information and Systems10.1587/transinf.2024EDP7055E107.D:10(1353-1361)Online publication date: 1-Oct-2024
https://doi.org/10.1587/transinf.2024EDP7055
Zhang HLi YZhu TLi C(2024)Commonsense-based adversarial learning framework for zero-shot stance detectionNeurocomputing10.1016/j.neucom.2023.126943563:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.neucom.2023.126943
Show More Cited By

Index Terms

Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Chinese and Vietnamese cross-lingual topic discovery based on word similarity of comparable corpus

In order to solve the problem of the scarcity of Chinese-Vietnamese comparable corpus and limited scale of bilingual dictionaries, we propose a method for cross-language topic discovery based on the similarity between Chinese and Vietnamese. Firstly, we ...
Automatic wordnet development for low-resource languages using cross-lingual WSD

Wordnets are an effective resource for natural language processing and information retrieval, especially for semantic processing and meaning related tasks. So far, wordnets have been constructed for many languages. However, the automatic development of ...
Learning bilingual word embedding for automatic text summarization in low resource language
Abstract
Studies in low-resource languages have become more challenging with the increasing volume of texts in today's digital era. Also, the lack of labeled data and text processing libraries in a language further widens the research gap ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 6

June 2023

635 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3604597

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 June 2023

Online AM: 18 April 2023

Accepted: 13 March 2023

Revised: 01 February 2023

Received: 22 September 2022

Published in TALLIP Volume 22, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Yunnan Provincial Major Science and Technology Special Plan Projects
General Projects of Basic Research in Yunnan Province
Kunming University of Science and Technology “double first-class” joint project

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
210
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)11

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dai GLiao JZhao SFu XPeng XHuang HZhang B(2025)Large Language Model Enhanced Logic Tensor Network for Stance DetectionNeural Networks10.1016/j.neunet.2024.106956183(106956)Online publication date: Mar-2025
https://doi.org/10.1016/j.neunet.2024.106956
HUANG YYANG YZHU ELIANG YXIAN Y(2024)Integrating Event Elements for Chinese-Vietnamese Cross-Lingual Event RetrievalIEICE Transactions on Information and Systems10.1587/transinf.2024EDP7055E107.D:10(1353-1361)Online publication date: 1-Oct-2024
https://doi.org/10.1587/transinf.2024EDP7055
Zhang HLi YZhu TLi C(2024)Commonsense-based adversarial learning framework for zero-shot stance detectionNeurocomputing10.1016/j.neucom.2023.126943563:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.neucom.2023.126943
Ma JWang CXing HZhao DZhang Y(2024)Chain of Stance: Stance Detection with Large Language ModelsNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_7(82-94)Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1007/978-981-97-9443-0_7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents