research-article

Urdu Short Paraphrase Detection at Sentence Level

Authors:

Hamza Hafeez,

Iqra Muneer,

Muhammad Sharjeel,

Muhammad Adnan Ashraf,

Rao Muhammad Adeel NawabAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 4

Article No.: 121, Pages 1 - 20

https://doi.org/10.1145/3586009

Published: 12 April 2023 Publication History

Get Access

Abstract

Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea; otherwise non-paraphrased. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. There have been very few efforts for paraphrase detection in South Asian languages. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. It is mainly due to the unavailability of the corpora that focus on the sentence level. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Moreover, several techniques were proposed, developed, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. N-gram is treated as the baseline technique for our research. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. The best result we obtained using the feature fusion technique (F₁ = 0.855). Our corpus is available and free to download for research purposes.

References

[1]

Sawood Alam, Fateh ud din B. Mehmood, and Michael L. Nelson. 2015. Improving accessibility of archived raster dictionaries of complex script languages. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’15). ACM, 47–56. DOI:

Abstract

References

Cited By

Index Terms

Recommendations

PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

Cross-Lingual Text Reuse Detection at sentence level for English–Urdu language pair

Sentence-Level Novelty Detection in English and Malay

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations