[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

The Contribution of Selected Linguistic Markers for Unsupervised Arabic Verb Sense Disambiguation

Published: 24 August 2023 Publication History

Abstract

Word sense disambiguation (WSD) is the task of automatically determining the meaning of a polysemous word in a specific context. Word sense induction is the unsupervised clustering of word usages in a different context to distinguish senses and perform unsupervised WSD. Most studies consider function words as stop words and delete them in the pre-processing step. However, function words can encode meaningful information that can help to improve the performance of WSD approaches. We propose in this work a novel approach to solve Arabic verb sense disambiguation that is based on a preposition-based classification that is used in an automatic word sense induction step to build sense inventories to disambiguate Arabic verbs.
However, in the wake of the success of neural language models, recent works obtained encouraging results using BERT pre-trained models for English-language WSD approaches. Hence, we use contextualized word embeddings for an unsupervised Arabic WSD that is based on linguistic markers and uses sentence-BERT Transformer pre-trained models, which yields encouraging results that outperform other existing unsupervised neural AWSD approaches.

References

[1]
Ahmed Abdelali, James Cowie, and Hamdy Soliman. 2005. Building a modern standard Arabic corpus. Workshop on Computational Modeling of Lexical Acquisition. The split meeting. Croatia, 25th to 28th of July.
[2]
Muhammad Abdul-Mageed, AbdelRahim A. Elmadany, and El Moatez Billah Nagoudi. 2021. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, 7088–7105.
[3]
Eneko Agirre, Bernardo Magnini, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau, and Piek Vossen. 2007. SemEval-2007 task 01: Evaluating WSD on cross-language information retrieval. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval’07). Association for Computational Linguistics, 1–6.
[4]
Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 1638–1649.
[5]
Moustafa Al-Hajj and Mustafa Jarrar. 2021. LU-BZU at SemEval-2021 task 2: Word2Vec and Lemma2Vec performance in Arabic word-in-context disambiguation. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Online. Association for Computational Linguistics, 748–755.
[6]
Marwah Alian and Arafat Awajan. 2020. Sense inventories for Arabic texts. In Proceedings of the 21st International Arab Conference on Information Technology (ACIT’20). 1–4. DOI:
[7]
Marwah Alian and Arafat Awajan. 2021. Generating sense inventories for ambiguous Arabic words. The International Arab Journal of Information Technology 18, 3A (2021). DOI:
[8]
Marwah Alian, Arafat Awajan, Ahmad Al-Hasan, and Raeda Akuzhia. 2019. Towards building Arabic paraphrasing benchmark. In Proceedings of the 2nd International Conference on Data Science, E-Learning and Information Systems (DATA’19). Association for Computing Machinery, New York, NY, Article 17, 5 pages. DOI:
[9]
Djaidri Asma Aliane Hassina, Azzoune Hamid. 2018. A new arabic word embedding model for word sense induction. In Proceedings of the 19th International Conference on Computational Linguistics and Intelligent Text Processing (CICling’18).
[10]
Djaidri Asma Aliane Hassina, Azzoune Hamid. 2021. Induction de sens des mots Arabe dans un espace vectoriel des mots. Rev. l’Inf. Sci. Techn. 25, 2 (2021), 21–31.
[11]
Ali Alkhatlan, Jugal Kalita, and Ahmed Alhaddad. 2018. Word sense disambiguation for Arabic exploiting Arabic WordNet and word embedding. Proced. Comput. Sci. 142 (2018), 50–60. DOI:. Arabic Computational Linguistics.
[12]
Zouaghi Anis, Laroussi Merhben, and Mounir Zrigui. 2012. Combination of information retrieval methods with LESK algorithm for Arabic word sense disambiguation. Artif. Intell. Rev. 38 (122012). DOI:
[13]
Wissam Antoun, Fady Baly, and Hazem M. Hajj. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, European Language Resource Association, Marseille, MN, 9–15.
[14]
Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, and Vit Suchomel. 2014. arTenTen: Arabic corpus and word sketches. J. King Saud Univ. Comput. Inf. Sci. 26, 4 (2014), 357–371. DOI:
[15]
Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Comput. Linguist. 34, 4 (dec2008), 555–596. DOI:
[16]
Raja Ayed, Ibrahim Bounhas, Bilel Elayeb, Fabrice Evrard, and Narjès Bellamine Ben Saoud. 2012. Arabic morphological analysis and disambiguation using a possibilistic classifier. Intell. Comput. Theor. Appl. (2012), 274–279.
[17]
Raja Ayed, Ibrahim Bounhas, Bilel Elayeb, Fabrice Evrard, and Narjès Bellamine Ben Saoud. 2012. A possibilistic approach for the automatic morphological disambiguation of Arabic texts. In Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. 187–194. DOI:
[18]
Michele Bevilacqua, Tommaso Pasini, Alessandro Raganato, and Roberto Navigli. 2021. Recent trends in word sense disambiguation: A survey. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI’21), Zhi-Hua Zhou (Ed.). International Joint Conferences on Artificial Intelligence Organization, 4330–4338. DOI:. Survey Track.
[19]
Jiang Bian, Bin Gao, and Tie-yan Liu. 2014. Knowledge-powered deep learning for word embedding. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECMLPKDD’14). 1–16.
[20]
Chris Biemann. 2006. Chinese whispers—An efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of TextGraphs: The 1st Workshop on Graph Based Methods for Natural Language Processing. Association for Computational Linguistics, New York, NY, 73–80.
[21]
Ibrahim Bounhas, Raja Ayed, Bilel Elayeb, and Narjès Bellamine Ben Saoud. 2015. A hybrid possibilistic approach for Arabic full morphological disambiguation. Data Knowl. Eng. 100 (2015), 240–254. DOI:
[22]
Ibrahim Bounhas, Raja Ayed, Bilel Elayeb, Fabrice Evrard, and Narjès Bellamine Ben Saoud. 2015. Experimenting a discriminative possibilistic classifier with reweighting model for Arabic morphological disambiguation. Comput. Speech Lang. 33, 1 (2015), 67–87. DOI:
[23]
Ibrahim Bounhas, Bilel Elayeb, F Evrard, and Y Slimani. 2010. Toward a computer study of the reliability of Arabic stories. J. Assoc. Inf. Sci. Technol. 61 (2010), 1686–1705.
[24]
Ibrahim Bounhas, Bilel Elayeb, Fabrice Evrard, and Yahya Slimani. 2011. ArabOnto: Experimenting a new distributional approach for building Arabic ontological resources. Int. J. Metadata Semant. Ontol. 6 (2011), 81–95.
[25]
Jose Camacho-collados and Mohammad Taher Pilehvar. 2018. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 63March 2019 (2018). DOI:
[26]
Howard Chen, Mengzhou Xia, and Danqi Chen. 2021. Non-parametric few-shot learning for word sense disambiguation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online. Association for Computational Linguistics, 1774–1781.
[27]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019, Association for Computational Linguistics, Minneapolis, MN, 4171–4186.
[28]
Mona Diab, Christiane Fellbaum, and Martha Palmer. 2007. Semeval 2007 task 18: Arabic semantic labeling. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval’07). 93–98.
[29]
Mona T. Diab. 2004. An unsupervised approach for bootstrapping Arabic sense tagging. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. COLING, 43–50. https://aclanthology.org/W04-1609.
[30]
Mohammed El-Razzaz, Mohamed Fakhr, and Fahima Maghraby. 2021. Arabic gloss WSD using BERT. Appl. Sci. 11 (032021), 2567. DOI:
[31]
Bilel Elayeb. 2019. Arabic word sense disambiguation: A review. Artif. Intell. Rev. 52, 4 (Dec.2019), 2475–2532. DOI:
[32]
K Elghamry. 2006. Sense and homograph disambiguation in arabic using coordination-based semantic similarity. In Proceedings of AUC-OXFORD Conference on Language and Linguistics.
[33]
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. The MIT Press. DOI:
[34]
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomás Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18), European Language Resources Association (ELRA), Miyazaki.
[35]
Wiedemann Gregor, Remus Steffen, Chawla Avi, and Biemann Chris. 2019. Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. arXiv:1909.10430. Retrieved from http://arxiv.org/abs/1909.10430.
[36]
Nizar Habash, Owen Rambow, and Ryan Roth. 2009. MADA+TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR’09).
[37]
Fouzi Harrag, Aboubekeur Hamdi-Cherif, Abdul Malik, S. Al-Salman, and Eyas El-Qawasmeh. 2022. Experiments in improvement of Arabic information retrieval. In Proceedings of the 3rd International Conference on Arabic Language Processing (CITALA’22).
[38]
Zellig S. Harris. 1981. Distributional Structure. Springer Netherlands, Dordrecht, 3–22. DOI:
[39]
Ronak Husni and Aziza Zaher. 2020. Working with Arabic Prepositions: Structures and Functions. Routledge (Taylor & Francis). DOI:
[40]
Nancy Ide and Keith Suderman. 2004. The American national corpus first release. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association (ELRA).
[41]
David Jurgens and Ioannis Klapaftis. 2013. SemEval-2013 task 13: Word sense induction for graded and non-graded senses. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: 7th International Workshop on Semantic Evaluation (SemEval’13), Vol. 2. 290–299.
[42]
David Jurgens and Ioannis Klapaftis. 2013. SemEval-2013 task 13: Word sense induction for graded and non-graded senses. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval’13). Association for Computational Linguistics, 290–299.
[43]
Rim Laatar, Chafik Aloulou, and Lamia Hadrich Belghuith. 2018. Word embedding for Arabic word sense disambiguation to create a historical dictionary for Arabic language. In Proceedings of the 8th International Conference on Computer Science and Information Technology (CSIT’18), 131–135.
[44]
Rim Laatar, Chafik Aloulou, and Lamia Hadrich Belghuith. 2018. Word2vec for Arabic word sense disambiguation. Nat. Lang. Process. Inf. Syst. (2018), 308–311. DOI:
[45]
Rim Laatar, Chafik Aloulou, and Lamia Hadrich Belguith. 2018. Word sense disambiguation using skip gram model to create a historical dictionary for Arabic. In Proceedings of the IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA’18). 1–5. DOI:
[46]
Rim Laatar, Chafik Aloulou, and Lamia Hadrich Belguith. 2020. Disambiguating Arabic words according to their historical appearance in the document based on recurrent neural networks. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 6 (2020). DOI:
[47]
Rim Laatar, Chafik Aloulou, and Lamia Hadrich Belguith. 2020. Towards a historical dictionary for Arabic language. Int. J. Speech Technol. (2020). DOI:
[48]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from http://arxiv.org/abs/1907.11692.
[49]
Varvara Logacheva, Denis Teslenko, Artem Shelmanov, Steffen Remus, Dmitry Ustalov, Andrey Kutuzov, Ekaterina Artemova, Chris Biemann, Simone Paolo Ponzetto, and Alexander Panchenko. 2020. Word sense disambiguation for 158 languages using word embeddings only. In Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, 5943–5952.
[50]
Daniel Loureiro, Alípio Mário Jorge, and Jose Camacho-Collados. 2022. LMMS reloaded: Transformer-based sense embeddings for disambiguation and beyond. Artif. Intell. 305 (2022), 103661. DOI:
[51]
Daniel Loureiro, Kiamehr Rezaee, Mohammad Taher Pilehvar, and José Camacho-Collados. 2020. Language models and word sense disambiguation: An overview and analysis. arXiv:2008.11608. Retrieved from https://arxiv.org/abs/2008.11608.
[52]
Fuli Luo, Tianyu Liu, Qiaolin Xia, Baobao Chang, and Zhifang Sui. 2018. Incorporating glosses into neural word sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, 2473–2482.
[53]
Aisha Mabrouk. 2017. الاختلاف بين التضميــــــن والنيابـــــــــــة في النحــــــو العربـــــــي. التعليمية 7, 2 (2017), 18–22.
[54]
Federico Martelli, Najla Kalach, Gabriele Tola, and Roberto Navigli. 2021. SemEval-2021 task 2: Multilingual and cross-lingual word-in-context disambiguation (MCL-WiC). In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval’21). Association for Computational Linguistics, 24–36. DOI:
[55]
Laroussi Merhbene, Anis Zouaghi, and Mounir Zrigui. 2009. Ambiguous Arabic words disambiguation: The results. In Recent Advances in Natural Language Processing (RANLP’09), Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, Nicolas Nicolov, and Nikolai Nikolov (Eds.). RANLP 2009 Organising Committee/ACL, 45–52. https://aclanthology.org/R09-2009/.
[56]
Laroussi Merhbene, Anis Zouaghi, and Mounir Zrigui. 2010. Ambiguous Arabic words disambiguation. In Proceedings of the 11th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD’10), 157–164. DOI:
[57]
Laroussi Merheben, Anis Zouaghi, and Mounir Zrigui. 2010. Arabic word sense disambiguation. In Proceedings of the International Conference on Agents and Artificial Intelligence, Vol 1, 652–655. DOI:
[58]
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13), Yoshua Bengio and Yann LeCun (Eds.).
[59]
Andrea Moro and Roberto Navigli. 2015. SemEval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval’15). Association for Computational Linguistics, 288–297. DOI:
[60]
Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Comput. Surv. 41, 2, Article 10 (Feb.2009), 69 pages. DOI:
[61]
Alok Ranjan Pal and Diganta Saha. 2015. Word sense disambiguation: A survey. International Journal of Control Theory and Computer Modeling (IJCTCM) 5, 3 (2015).
[62]
Tommaso Pasini, Federico Scozzafava, and Bianca Scarlini. 2020. CluBERT: A cluster-based approach for learning sense distributions in multiple languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4008–4018. DOI:
[63]
Maria Pelevina, Nikolay Arefyev, Chris Biemann, and Alexander Panchenko. 2017. Making sense of word embeddings. arXiv:DOI:. Retrieved from https://arxiv.org/abs/1708.03390
[64]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543. DOI:
[65]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, LA, 2227–2237.
[66]
David Pinto, Paolo Rosso, and Yassine Benajiba. 2007. Word sense induction in the Arabic language: A self-term expansion based approach. In Proceedings of the Egyptian Society of Language Engineering (ESOLE’07).
[67]
Xiao Pu, Nikolaos Pappas, James Henderson, and Andrei Popescu-Belis. 2018. Integrating weakly supervised word sense disambiguation into neural machine translation. arXiv:1810.02614. Retrieved from http://arxiv.org/abs/1810.02614.
[68]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
[69]
Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. Association for Computational Linguistics, 4512–4525.
[70]
Motaz Saad and Wesam Ashour. 2010. OSAC: Open source Arabic corpora. In Proceedings of the 6th International Conference on Electrical and Computer Systems (EECS’10).118–123. DOI:
[71]
Bianca Scarlini, Tommaso Pasini, and Roberto Navigli. 2020. SensEmBERT: Context-enhanced sense embeddings for multilingual word sense disambiguation. In Proceedings of the AAAI Conference on Artificial Intelligence. 8758–8765. DOI:
[72]
Sibawayhi. 1988. الكتاب كتاب سيبويه ابي بشر عمرو بن عثمان بن قنبر. Vol. 1. مكتبة الخانجي بالقاهرة.
[73]
Rabiah Sitti. 2012. Language as a tool for communication and cultural reality discloser. In Proceedings of the 1st International Conference on Media, Communication and Culture “Rethinking Multiculturalism: Media in Multicultural Society.”1–11.
[74]
Abu Bakr Soliman, Kareem Eissa, and Samhaa R. El-Beltagy. 2017. AraVec: A set of Arabic word embedding models for use in Arabic NLP. Proc. Comput. Sci. 117 (2017), 256–265. DOI:
[75]
Gongbo Tang, Gaoqi Rao, Dong Yu, and Endong Xun. 2016. Can we neglect function words in word embedding? 10102 (122016), 541–548. DOI:
[76]
Nemika Tyagi, Sudeshna Chakraborty, Jyotsna, Aditya Kumar, and Nzanzu Katasohire Romeo. 2022. Word sense disambiguation models emerging trends: A comparative analysis. J. Phys.: Conf. Ser. 2161, 1 (Jan.2022), 012035. DOI:
[77]
Stjin van Dongen. 2000. Graph stimulation by flow clustering. Graph Stimulation by Flow Clustering. PhD Dissertation, University of Utrecht. DOI:
[78]
Dominic Widdows, Grab Technologies, and Beate Dorow. 2002. A graph model for unsupervised lexical acquisition a graph model for unsupervised lexical acquisition. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’19). DOI:
[79]
Anis Zouaghi, Laroussi Merhbène, and Mounir Zrigui. 2012. A hybrid approach for Arabic word sense disambiguation. Int. J. Comput. Process. Lang. 24, 02 (2012), 133–151. DOI:

Cited By

View all

Index Terms

  1. The Contribution of Selected Linguistic Markers for Unsupervised Arabic Verb Sense Disambiguation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 8
    August 2023
    373 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3615980
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2023
    Online AM: 11 July 2023
    Accepted: 09 June 2023
    Revised: 07 February 2023
    Received: 01 July 2021
    Published in TALLIP Volume 22, Issue 8

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Natural language processing
    2. word sense disambiguation
    3. word sense induction
    4. Arabic language
    5. linguistic markers
    6. clustering
    7. contextualized word embeddings
    8. SBERT

    Qualifiers

    • Research-article

    Funding Sources

    • Research Center for Scientific and Technical Information (CERIST)
    • Laboratory of Research in Artificial Intelligence (LRIA)
    • Algerian Directorate General for Scientific Research and Technology Development (DGRSTD)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media