[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Urdu Short Paraphrase Detection at Sentence Level

Published: 12 April 2023 Publication History

Abstract

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique (F1 = 0.855). Our corpus is available and free to download for research purposes.

References

[1]
Sawood Alam, Fateh ud din B. Mehmood, and Michael L. Nelson. 2015. Improving accessibility of archived raster dictionaries of complex script languages. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’15). ACM, 47–56. DOI:
[2]
Faisal Alvi, El-Sayed M. El-Alfy, Wasfi G. Al-Khatib, and Radwan E. Abdel-Aal. 2012. Analysis and extraction of sentence-level paraphrase sub-corpus in CS education. In Proceedings of the 13th Annual Conference on Information Technology Education (SIGITE’12). Association for Computing Machinery, New York, NY, 49–54. DOI:
[3]
Alberto Barrón-Cedeño, Marta Vila, M. Antònia Martí, and Paolo Rosso. 2013. Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computat. Ling. 39, 4 (2013), 917–947.
[4]
Regina Barzilay and Lillian Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics, 16–23.
[5]
Vuk Batanović, Bojan Furlan, and Boško Nikolić. 2011. A software system for determining the semantic similarity of short texts in Serbian. In Proceedings of the 19th Telecommunications Forum (TELFOR). IEEE, 1249–1252.
[6]
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 632–642. DOI:
[7]
Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 1 (Mar.2006), 13–47. DOI:
[8]
Steven Burrows, Martin Potthast, and Benno Stein. 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4, 3 (2013), 43.
[9]
Paul Clough and Rob Gaizauskas. 2009. Corpora and Text Re-use. De Gruyter Mouton, 1249–1271. DOI:
[10]
Paul Clough, Robert Gaizauskas, Scott S. L. Piao, and Yorick Wilks. 2002. Meter: Measuring text reuse. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 152–159.
[11]
Paul Clough and Mark Stevenson. 2011. Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45, 1 (2011), 5–24.
[12]
Trevor Cohn, Chris Callison-Burch, and Mirella Lapata. 2008. Constructing corpora for the development and evaluation of paraphrase systems. Computat. Ling. 34, 4 (2008), 597–614.
[13]
Ali Daud, Wahab Khan, and Dunren Che. 2017. Urdu language processing: A survey. Arti. Intell. Rev. 47, 3 (2017), 279–311.
[14]
Seniz Demir, Ilknur Durgar El-Kahlout, Erdem Unal, and Hamza Kaya. 2012. Turkish paraphrase corpus. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), 4087–4091. Retrieved from http://www.lrec-conf.org/proceedings/lrec2012/summaries/968.html.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,. Association for Computational Linguistics, 4171–4186. DOI:
[16]
William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd International Workshop on Paraphrasing. 9–16.
[17]
Asif Ekbal, Sriparna Saha, and Gaurav Choudhary. 2012. Plagiarism detection in text using vector space model. In Proceedings of the 12th International Conference on Hybrid Intelligent Systems (HIS). 366–371.
[18]
Mohamed I. El Desouki, Wael H. Gomaa, and Hawaf Abdalhakim. 2019. A hybrid model for paraphrase detection combines pros of text similarity with deep learning. Int. J. Comput. Appl. 975 (2019), 8887.
[19]
Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 1608–1618.
[20]
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 878–891. DOI:
[21]
Alena Fenogenova. 2021. Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing. 11–19.
[22]
Samuel Fernando and Mark Stevenson. 2008. A semantic similarity approach to paraphrase detection. In Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics.
[23]
Chris Forsythe, Michael L. Bernard, and Timothy E. Goldsmith. 2006. Cognitive Systems: Human Cognitive Models in Systems Design. Psychology Press.
[24]
Veena Gangadharan, Deepa Gupta, L. Amritha, and T. A. Athira2020. Paraphrase detection using deep neural network-based word embedding techniques. In Proceedings of the 4th International Conference on Trends in Electronics and Informatics (ICOEI). 517–521. DOI:
[25]
Sahar Ghannay, Benoit Favre, Yannick Esteve, and Nathalie Camelin. 2016. Word embedding evaluation and combination. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 300–305.
[26]
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1550.
[27]
Vadim Gudkov, Olga Mitrofanova, and Elizaveta Filippskikh. 2020. Automatically ranked Russian paraphrase corpus for text generation. In Proceedings of the 4th Workshop on Neural Generation and Translation. Association for Computational Linguistics, 54–59. DOI:
[28]
Xiao Guo, Hengameh Mirzaalian, Ekraam Sabir, Ayush Jaiswal, and Wael Abd-Almageed. 2020. CORD19STS: COVID-19 Semantic Textual Similarity Dataset. arxiv:cs.CL/2007.02461.
[29]
Yaakov HaCohen-Kerner, Zuriel Gross, and Asaf Masa. 2005. Automatic extraction and learning of keyphrases from scientific articles. Lect. Notes Comput. Sci. 3406 (2005), 657–669. DOI:
[30]
Samar Haider. 2018. Urdu word embeddings. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA). Retrieved from https://aclanthology.org/L18-1155.
[31]
Mena Hany and Wael H. Gomaa. 2022. A hybrid approach to paraphrase detection based on text similarities and machine learning classifiers. In Proceedings of the 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC). 343–348. DOI:
[32]
Hangfeng He, Qiang Ning, and Dan Roth. 2020. QuASE: Question-answer driven sentence encoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics8743–8758. DOI:
[33]
Ethan Hunt, Ritvik Janamsetty, Chanana Kinares, Chanel Koh, Alexis Sanchez, Felix Zhan, Murat Ozdemir, Shabnam Waseem, Osman Yolcu, Binay Dahal, Justin Zhan, Laxmi Gewali, and Paul Oh. 2019. Machine learning models for paraphrase identification and its applications on plagiarism detection. In Proceedings of the IEEE International Conference on Big Knowledge (ICBK). 97–104. DOI:
[34]
Safia Kanwal, Kamran Malik, Khurram Shahzad, Faisal Aslam, and Zubair Nawaz. 2019. Urdu named entity recognition: Corpus generation and deep learning applications. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 1 (2019), 1–13.
[35]
Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, and Minlie Huang. 2020. SentiLARE: Linguistic knowledge enhanced language representation for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 6975–6988.
[36]
Khadijeh Khoshnavataher, Vahid Zarrabi, Salar Mohtaj, and Habibollah Asghari. 2015. Developing monolingual Persian corpus for extrinsic plagiarism detection using artificial obfuscation: Notebook for PAN at CLEF 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum, Toulouse, France, September 8-11, 2015 (CEUR Workshop Proceedings), Vol. 1391. CEUR-WS.org. Retrieved from http://ceur-ws.org/Vol-1391/146-CR.pdf.
[37]
Alfirna Rizqi Lahitani, Adhistya Erna Permanasari, and Noor Akhmad Setiawan. 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In Proceedings of the 4th International Conference on Cyber and IT Service Management. IEEE, 1–6.
[38]
Arthur Malajyan, Karen Avetisyan, and Tsolak Ghukasyan. 2020. ARPA: Armenian Paraphrase Detection Corpus and Models. arxiv:cs.CL/2009.12615.
[39]
Riccardo Massidda. 2020. rmassidda@DaDoEval: Document dating using sentence embeddings at EVALITA 2020. In Proceedings of 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA’20).
[40]
Tony McEnery, Paul Baker, and Lou Burnard. 2000. Corpus resources and minority language engineering. In Proceedings of the International Conference on Language Resources and Evaluation.
[41]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746–751.
[42]
Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning based text classification: A comprehensive review. ACM Comput. Surv. 54, 3 (2021), 1–40.
[43]
Yusuke Mori, Hiroaki Yamane, Yusuke Mukuta, and Tatsuya Harada. 2020. Finding and generating a missing part for story completion. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 156–166.
[44]
Stanislav Naumov, Grigory Yaroslavtsev, and Dmitrii Avdiukhin. 2021. Objective-based hierarchical clustering of deep embedding vectors. In Proceedings of the AAAI Conference on Artificial Intelligence. 9055–9063.
[45]
Jakob Navrozidis and Hannes Jansson. 2020. Using Natural Language Processing to Identify Similar Patent Documents. LU-CS-EX (2020).
[46]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
[47]
Lidia Pivovarova, Ekaterina Pronoza, Elena Yagunova, and Anton Pronoza. 2017. ParaPhraser: Russian paraphrase corpus and shared task. In Proceedings of the Conference on Artificial Intelligence and Natural Language. Springer, 211–225.
[48]
Martin Potthast, Benno Stein, Alberto Barrón-Cedeño, and Paolo Rosso. 2010. An evaluation framework for plagiarism detection. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Association for Computational Linguistics, 997–1005.
[49]
Ekaterina Pronoza, Elena Yagunova, and Anton Pronoza. 2016. Construction of a Russian paraphrase corpus: Unsupervised paraphrase extraction. In Information Retrieval. Springer, 146–157.
[50]
Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2685–2702. DOI:
[51]
Nils Reimers. 2020. sentence embeddings using siamese BERT-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language.3982–3992.
[52]
Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arxiv:cs.CL/2004.09813.
[53]
Philip Resnik, Olivia Buzek, Chang Hu, Yakov Kronrod, Alex Quinn, and Benjamin B. Bederson. 2010. Improving translation via targeted paraphrasing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 127–137.
[54]
Reyhaneh Sadeghi, Hamed Karbasi, and Ahmad Akbari. 2022. ExaPPC: A large-scale persian paraphrase detection corpus. In Proceedings of the 8th International Conference on Web Research (ICWR). 168–175. DOI:
[55]
Sara Sameen, Muhammad Sharjeel, Rao Muhammad Adeel Nawab, Paul Rayson, and Iqra Muneer. 2018. Measuring short text reuse for the Urdu language. IEEE Access 6, 1 (2018), 7412–7421. DOI:
[56]
Hassan Shahmohammadi, MirHossein Dezfoulian, and Muharram Mansoorizadeh. 2021. Paraphrase detection using LSTM networks and handcrafted features. Multim. Tools Applic. 80, 4 (2021), 6479–6492.
[57]
Muhammad Sharjeel, Rao Muhammad Adeel Nawab, and Paul Rayson. 2017. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 51, 3 (01 Sept.2017), 777–803. DOI:
[58]
Muhammad Sharjeel, Paul Rayson, and Rao Muhammad Adeel Nawab. 2016. UPPC-Urdu paraphrase plagiarism corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 1832–1836.
[59]
Yusuke Shinyama and Satoshi Sekine. 2003. Paraphrase acquisition for information extraction. In Proceedings of the 2nd International Workshop on Paraphrasing. Association for Computational Linguistics, 65–71.
[60]
Nandan Thakur, Nils Reimers, Johannes Daxenberger, and Iryna Gurevych. 2021. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 296–310. DOI:
[61]
Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1661–1670. DOI:
[62]
M. K. Vijaymeena and K. Kavitha. 2016. A survey on similarity measures in text mining. Mach. Learn. Applic. Int. J. 3, 2 (2016), 19–28.
[63]
Marta Vila, Horacio Rodríguez, and M. Antònia Martí. 2015. Relational paraphrase acquisition from Wikipedia: The WRPA method and corpus. Nat. Lang. Eng. 21, 3 (2015), 355–389.
[64]
Tedo Vrbanec and Ana Meštrović. 2020. Corpus-based paraphrase detection experiments and review. Information 11, 5 (2020), 241.
[65]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1112–1122. DOI:
[66]
Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. 2021. Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 1154–1156.

Cited By

View all
  • (2024)Application of the transformer model algorithm in chinese word sense disambiguation: a case study in chinese languageScientific Reports10.1038/s41598-024-56976-514:1Online publication date: 15-Mar-2024
  • (2024)Mono-lingual text reuse detection for the Urdu language at lexical levelEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109003136:PBOnline publication date: 1-Oct-2024
  • (2023)Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approachExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121063234:COnline publication date: 30-Dec-2023

Index Terms

  1. Urdu Short Paraphrase Detection at Sentence Level

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
    April 2023
    682 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3588902
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 April 2023
    Online AM: 28 February 2023
    Accepted: 24 February 2023
    Revised: 22 February 2023
    Received: 15 October 2022
    Published in TALLIP Volume 22, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Urdu paraphrase detection
    2. Urdu paraphrases
    3. Urdu corpus generation
    4. Urdu language

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)149
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Application of the transformer model algorithm in chinese word sense disambiguation: a case study in chinese languageScientific Reports10.1038/s41598-024-56976-514:1Online publication date: 15-Mar-2024
    • (2024)Mono-lingual text reuse detection for the Urdu language at lexical levelEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109003136:PBOnline publication date: 1-Oct-2024
    • (2023)Urdu Text Reuse Detection at Phrasal level using Sentence Transformer-based approachExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121063234:COnline publication date: 30-Dec-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media