[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning

Published: 16 June 2023 Publication History

Abstract

Cross-lingual sentence embedding’s goal is mapping sentences with similar semantics but in different languages close together and dissimilar sentences farther apart in the representation space. It is the basis of many downstream tasks such as cross-lingual document matching and cross-lingual summary extraction. At present, the works of cross-lingual sentence embedding tasks mainly focus on languages with large-scale corpus. But low-resource languages such as Chinese-Vietnamese are short of sentence-level parallel corpora and clear cross-lingual monitoring signals, and these works on low-resource languages have poor performances. Therefore, we propose a cross-lingual sentence embedding method based on contrastive learning and effectively fine-tune powerful pretraining mode by constructing sentence-level positive and negative samples to avoid the catastrophic forgetting problem of the traditional fine-tuning pre-trained model based only on small-scale aligned positive samples. First, we construct positive and negative examples by taking parallel Chinese Vietnamese sentences as positive examples and non-parallel sentences as negative examples. Second, we construct a siamese network to get contrastive loss by inputting positive and negative samples and fine-tuning our model. The experimental results show that our method can effectively improve the semantic alignment accuracy of cross-lingual sentence embedding in Chinese and Vietnamese contexts.

References

[1]
Quoc V. Le and Tomás Mikolov. 2014. Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014).
[2]
Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. CoRR abs/1703.02507 (2017).
[3]
Ahmed El-Kishky and Francisco Guzmán. 2020. Massively multilingual document alignment with cross-lingual sentence-mover’s distance. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 616–625. Retrieved from https://aclanthology.org/2020.aacl-main.62/.
[4]
Yu Bai, Yang Gao, and Heyan Huang. 2021. Cross-lingual abstractive summarization with limited parallel resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 6910–6924. DOI:
[5]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 3980–3990. DOI:
[6]
Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Representation degeneration problem in training natural language generation models. In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=SkEYojRqtm.
[7]
Jindrich Libovický, Rudolf Rosa, and Alexander Fraser. 2020. On the language neutrality of pre-trained multilingual representations. In Findings of the Association for Computational Linguistics(Findings of ACL). Association for Computational Linguistics, 1663–1674. DOI:
[8]
Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, and Kyle Mahowald. 2021. Deep subjecthood: Higher-order grammatical features in multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2522–2532. DOI:
[9]
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 4996–5001. DOI:
[10]
Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Ling. 7 (2019), 597–610. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/1742.
[11]
Phoebe Mulcaire, Jungo Kasai, and Noah A. Smith. 2019. Polyglot contextual representations improve crosslingual transfer. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 3912–3918. DOI:
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186. DOI:
[13]
Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. CoRR abs/1901.07291 (2019).
[14]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 8440–8451. DOI:
[15]
Xin Liu, Baosong Yang, Dayiheng Liu, Haibo Zhang, Weihua Luo, Min Zhang, Haiying Zhang, and Jinsong Su. 2021. Bridging subword gaps in pretrain-finetune paradigm for natural language generation. CoRR abs/2106.06125 (2021).
[16]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A simple framework for contrastive learning of visual representations. CoRR abs/2002.05709 (2020).
[17]
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. CoRR abs/2004.11362 (2020).
[18]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. CoRR abs/2104.08821 (2021).
[19]
Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-Wen Li, Scott Yih, Yoon Kim, and James R. Glass. 2022. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4207–4218. DOI:
[20]
Huan Lin, Liang Yao, Baosong Yang, Dayiheng Liu, Haibo Zhang, Weihua Luo, Degen Huang, and Jinsong Su. 2021. Towards user-driven neural machine translation. CoRR abs/2106.06200 (2021).
[21]
Jane Bromley, James W. Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a “Siamese” time delay neural network. Int. J. Pattern Recog. Artif. Intell. 7, 4 (1993), 669–688. DOI:
[22]
Zhen Wang, Xiangxie Zhang, and Yicong Tan. 2021. Chinese sentences similarity via cross-attention based siamese network. CoRR abs/2104.08787 (2021).
[23]
Andreas Rücklé, Steffen Eger, Maxime Peyrard, and Iryna Gurevych. 2018. Concatenated p-mean word embeddings as universal cross-lingual sentence representations. CoRR abs/1803.01400 (2018).
[24]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998–6008. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[25]
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580 (2012).
[26]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683 (2019).
[27]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. CoRR abs/2010.11934 (2020).
[28]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116 (2019).
[29]
Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Retrieved from https://hal.inria.fr/hal-02131630.

Cited By

View all
  • (2025)Large Language Model Enhanced Logic Tensor Network for Stance DetectionNeural Networks10.1016/j.neunet.2024.106956183(106956)Online publication date: Mar-2025
  • (2024)Integrating Event Elements for Chinese-Vietnamese Cross-Lingual Event RetrievalIEICE Transactions on Information and Systems10.1587/transinf.2024EDP7055E107.D:10(1353-1361)Online publication date: 1-Oct-2024
  • (2024)Commonsense-based adversarial learning framework for zero-shot stance detectionNeurocomputing10.1016/j.neucom.2023.126943563:COnline publication date: 1-Jan-2024
  • Show More Cited By

Index Terms

  1. Cross-lingual Sentence Embedding for Low-resource Chinese-Vietnamese Based on Contrastive Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
    June 2023
    635 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3604597
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 June 2023
    Online AM: 18 April 2023
    Accepted: 13 March 2023
    Revised: 01 February 2023
    Received: 22 September 2022
    Published in TALLIP Volume 22, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese-Vietnamese
    2. low-resource language
    3. cross-lingual sentence embedding
    4. siamese network
    5. mBERT

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Yunnan Provincial Major Science and Technology Special Plan Projects
    • General Projects of Basic Research in Yunnan Province
    • Kunming University of Science and Technology “double first-class” joint project

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)70
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Large Language Model Enhanced Logic Tensor Network for Stance DetectionNeural Networks10.1016/j.neunet.2024.106956183(106956)Online publication date: Mar-2025
    • (2024)Integrating Event Elements for Chinese-Vietnamese Cross-Lingual Event RetrievalIEICE Transactions on Information and Systems10.1587/transinf.2024EDP7055E107.D:10(1353-1361)Online publication date: 1-Oct-2024
    • (2024)Commonsense-based adversarial learning framework for zero-shot stance detectionNeurocomputing10.1016/j.neucom.2023.126943563:COnline publication date: 1-Jan-2024
    • (2024)Chain of Stance: Stance Detection with Large Language ModelsNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_7(82-94)Online publication date: 2-Nov-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media