[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Detection of Offensive Language and ITS Severity for Low Resource Language

Published: 17 June 2023 Publication History

Abstract

Continuous proliferation of hate speech in different languages on social media has drawn significant attention from researchers in the past decade. Detecting hate speech is indispensable irrespective of the scale of use of language, as it inflicts huge harm on society. This work presents a first resource for classifying the severity of hate speech in addition to classifying offensive and hate speech content. Current research mostly limits hate speech classification to its primary categories, such as racism, sexism, and hatred of religions. However, hate speech targeted at different protected characteristics also manifests in different forms and intensities. It is important to understand varying severity levels of hate speech so that the most harmful cases of hate speech may be identified and dealt with earlier than the less harmful ones. In this work, we focus on detecting offensive speech, hate speech, and multiple levels of hate speech in the Urdu language. We investigate three primary target categories of hate speech: religion, racism, and national origin. We further divide these categories into levels based on the severity of hate conveyed. The severity levels are referred to as symbolization, insult, and attribution. A corpus comprising more than 20,000 tweets against the corresponding hate speech categories and severity levels is collected and annotated. A comprehensive experimentation scheme is applied using traditional as well as deep learning–based models to examine their impact on hate speech detection. The highest macro-averaged F-score yielded for detecting offensive speech is 86% while the highest F-scores for detecting hate speech with respect to ethnicity, national origin, and religious affiliation are 80%, 81%, and 72%, respectively. This shows that results are very encouraging and would provide a lead towards further investigation in this domain.

References

[1]
Swati Agarwal and Ashish Sureka. 2016. But I did not mean it! Intent classification of racist posts on Tumblr. In 2016 European Intelligence and Security Informatics Conference (EISIC’16). IEEE, 124–127.
[2]
Muhammad Pervez Akhter, Zheng Jiangbin, Irfan Raza Naqvi, Mohammed Abdelmajeed, and Muhammad Tariq Sadiq. 2020. Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access 8 (2020), 91213–91226.
[3]
Qurat-ul-Ain Akram, Asma Naseer, and Sarmad Hussain. 2009. Assas-Band, an affix-exception-list based Urdu stemmer. In Proceedings of the 7th Workshop on Asian Language Resources. Association for Computational Linguistics, 40–46.
[4]
Nuha Albadi, Maram Kurdi, and Shivakant Mishra. 2018. Are they our brothers? Analysis and detection of religious hate speech in the Arabic twittersphere. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’18). IEEE, 69–76.
[5]
Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata. 2017. Hate speech detection in the Indonesian language: A dataset and preliminary study. In 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS’17).
[6]
Maria Anzovino, Elisabetta Fersini, and Paolo Rosso. 2018. Automatic identification and classification of misogynistic language on Twitter. In International Conference on Applications of Natural Language to Information Systems. Springer, 57–64.
[7]
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 759–760.
[8]
Christopher Bagdon. 2021. Profiling spreaders of hate speech with N-grams and RoBERTa. In CLEF (Working Notes). 1822–1828.
[9]
Haris Bin Zia, Agha Ali Raza, and Awais Athar. 2018. Urdu word segmentation using conditional random fields (CRFs). In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 2562–2569. http://aclweb.org/anthology/C18-1217.
[10]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[11]
Cristina Bosco, Dell’Orletta Felice, Fabio Poletto, Manuela Sanguinetti, and Tesconi Maurizio. 2018. Overview of the EVALITA 2018 hate speech detection task. In EVALITA 6th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Vol. 2263. CEUR, 1–9.
[12]
Pete Burnap and Matthew L. Williams. 2016. Us and them: Identifying cyber hate on Twitter across multiple protected characteristics. EPJ Data Science 5, 1 (2016), 11.
[13]
Despoina Chatzakou, Nicolas Kourtellis, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, and Athena Vakali. 2017. Mean Birds: Detecting aggression and bullying on Twitter. arXiv preprint arXiv:1702.06877 (2017).
[14]
Vikas S. Chavan and S. S. Shylaja. 2015. Machine learning approach for detection of cyber-aggressive comments by peers on social media network. In 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI’15). IEEE, 2354–2358.
[15]
Hao Chen, Susan Mckeever, and Sarah Jane Delany. 2017. Harnessing the power of text mining for the detection of abusive content in social media. In Advances in Computational Intelligence Systems. Springer, 187–205.
[16]
Mithun Das, Somnath Banerjee, and Punyajoy Saha. 2021. Abusive and threatening language detection in Urdu using boosting based and BERT based models: A comparative approach. arXiv preprint arXiv:2111.14830 (2021).
[17]
Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. arXiv preprint arXiv:1703.04009 (2017).
[18]
Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and Narayan Bhamidipati. 2015. Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide Web. ACM, 29–30.
[19]
Kate Eichhorn. 2001. Re-in/citing linguistic injuries: Speech acts, cyberhate, and the spatial and temporal character of networked environments. Computers and Composition 18, 3 (2001), 293–304.
[20]
Facebook. Hate speech. (2022). Retrieved January 31, 2023 from https://www.facebook.com/communitystandards/hate_speech.
[21]
Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys 51, 4 (2018), 1–30.
[22]
Björn Gambäck and Utpal Kumar Sikdar. 2017. Using convolutional neural networks to classify hate-speech. In Proceedings of the 1st Workshop on Abusive Language Online. 85–90.
[23]
Lei Gao, Alexis Kuppersmith, and Ruihong Huang. 2017. Recognizing explicit and implicit hate speech using a weakly supervised two-path bootstrapping approach. arXiv preprint arXiv:1710.07394 (2017).
[24]
Priya Goyal and Gaganpreet Singh Kalra. 2013. Peer-to-peer insult detection in online communities. IITK Unpubl (2013).
[25]
Mario Graff, Sabino Miranda-Jiménez, Eric Sadit Tellez, Daniela Moctezuma, Vladimir Salgado, José Ortiz-Bejar, and Claudia N. Sánchez. 2018. INGEOTEC at MEX-A3T: Author profiling and aggressiveness analysis in Twitter using \(\mu\)TC and EvoMSA. In IberEval@ SEPLN. 128–133.
[26]
Edel Greevy and Alan F. Smeaton. 2004. Classifying racist texts using a support vector machine. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 468–469.
[27]
Jahanzaib Haque. 2014. Hate speech: A study of Pakistan’s cyberspace. Islamabad, Pakistan: Bytes4all (2014).
[28]
Qianjia Huang, Vivek Kumar Singh, and Pradeep Kumar Atrey. 2014. Cyber bullying detection using social and textual analysis. In Proceedings of the 3rd International Workshop on Socially-Aware Multimedia. 3–6.
[29]
Timothy Jay and Kristin Janschewitz. 2008. The pragmatics of swearing. Journal of Politeness Research. Language, Behaviour, Culture 4, 2 (2008), 267–288.
[30]
Věra Jourová. 2016. Code of conduct on countering illegal hate speech online: First results on implementation. European Commission.[cit. 8. březen 2018] (2016).
[31]
Vera Jourová. 2016. Code of Conduct on countering illegal hate speech online: First results on implementation. Factsheet Directorate-General for Justice and Consumers.
[32]
Ezgi Kan, Merve Nebioglu, Seyma Özkan, Funda Tekin, and Gamze Tosun. 2018. Media watch on hate speech report January–April 2018. Hrant Dink Foundation.
[33]
Muhammad Moin Khan, Khurram Shahzad, and Muhammad Kamran Malik. 2021. Hate speech detection in Roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 1 (2021), 1–19.
[34]
Ryan D. King and Gretchen M. Sutton. 2013. High times for hate crimes: Explaining the temporal clustering of hate-motivated offending. Criminology 51, 4 (2013), 871–894.
[35]
Artur Kulmizev, Bo Blankers, Johannes Bjerva, Malvina Nissim, Gertjan van Noord, Barbara Plank, and Martijn Wieling. 2017. The power of character n-grams in native language identification. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 382–389.
[36]
Zachary Laub. 2019. Hate speech on social media: Global comparisons. (June2019). Retrieved January 31, 2023 from https://www.cfr.org/backgrounder/hate-speech-social-media-global-comparisons.
[37]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188–1196.
[38]
Laura Leets. 2001. Responses to Internet hate sites: Is speech too free in cyberspace? Communication Law & Policy 6, 2 (2001), 287–317.
[39]
Thomas Mandl, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia, and Aditya Patel. 2019. Overview of the HASOC track at fire 2019: Hate speech and offensive content identification in Indo-European languages. In Proceedings of the 11th Forum for Information Retrieval Evaluation. 14–17.
[40]
Priscilla Marie Meddaugh and Jack Kay. 2009. Hate speech or “reasonable racism?” The other in Stormfront. Journal of Mass Media Ethics 24, 4 (2009), 251–268.
[41]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[42]
Bastian Birkeneder, Jelena Mitrovic, Julia Niemeier, Leon Teubert, and Siegfried Handschuh. 2018. upInf - Offensive language detection in German tweets. In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS’18).
[43]
Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 2017. Abusive language detection on Arabic social media. In Proceedings of the 1st Workshop on Abusive Language Online. 52–56.
[44]
Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 145–153.
[45]
Atte Oksanen, James Hawdon, Emma Holkeri, Matti Näsi, and Pekka Räsänen. 2014. Exposure to online hate among young social media users. Sociological Studies of Children & Youth 18, 1 (2014), 253–273.
[46]
Georgios K. Pitsilis, Heri Ramampiaro, and Helge Langseth. 2018. Effective hate-speech detection in Twitter data using recurrent neural networks. Applied Intelligence 48, 12 (2018), 4730–4742.
[47]
Shofianina Dwi Ananda Putri, Muhammad Okky Ibrohim, and Indra Budi. 2021. Abusive language and hate speech detection for Indonesian-local language in social media text. In International Conference on Computing and Information Technology. Springer, 88–98.
[48]
Tharindu Ranasinghe and Marcos Zampieri. 2021. Multilingual offensive language identification for low-resource languages. Transactions on Asian and Low-Resource Language Information Processing 21, 1 (2021), 1–13.
[49]
Council of Europe Committee of Ministers. 1997. Recommendation No. R (97) 20 of the Committee of Ministers to member states on “hate speech”. (1997). Retrieved January 31, 2023 from https://rm.coe.int/1680505d5b.
[50]
Hammad Rizwan, Muhammad Haroon Shakeel, and Asim Karim. 2020. Hate-speech and offensive language detection in Roman Urdu. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2512–2522.
[51]
Tauqeer Sajid, Mehdi Hassan, Mohsan Ali, and Rabia Gillani. 2020. Roman Urdu multi-class offensive text detection using hybrid features and SVM. In 2020 IEEE 23rd International Multitopic Conference (INMIC’20). IEEE, 1–5.
[52]
Niloofar Safi Samghabadi, Suraj Maharjan, Alan Sprague, Raquel Diaz-Sprague, and Thamar Solorio. 2017. Detecting nastiness in social media. In Proceedings of the 1st Workshop on Abusive Language Online. 63–72.
[53]
Twitter. 2020. Hateful conduct policy. (2020). Retrieved January 31, 2023 from https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy.
[54]
Inna Vogel and Meghana Meghana. 2021. Profiling hate speech spreaders on Twitter: SVM vs. Bi-LSTM. In CLEF (Working Notes). 2193–2200.
[55]
William Warner and Julia Hirschberg. 2012. Detecting hate speech on the World Wide Web. In Proceedings of the 2nd Workshop on Language in Social Media. Association for Computational Linguistics, 19–26.
[56]
Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In SRW@ HLT-NAACL. 88–93.
[57]
Hajime Watanabe, Mondher Bouazizi, and Tomoaki Ohtsuki. 2018. Hate speech on Twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access 6 (2018), 13825–13835.
[58]
Michael Wiegand, Melanie Siegel, and Josef Ruppenhofer. 2018. Overview of the germeval 2018 shared task on the identification of offensive language. (2018).
[59]
YouTube. 2020. Hate speech policy. (2020). Retrieved January 31, 2023 from https://support.google.com/youtube/answer/2801939?hl=en.
[60]
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983 (2019).
[61]
Ziqi Zhang, David Robinson, and Jonathan Tepper. 2018. Detecting hate speech on Twitter using a convolution-GRU based deep neural network. In European Semantic Web Conference. Springer, 745–760.
[62]
Rui Zhao, Anna Zhou, and Kezhi Mao. 2016. Automatic detection of cyberbullying on social networks based on bullying features. In Proceedings of the 17th International Conference on Distributed Computing and Networking. ACM, 43.

Cited By

View all
  • (2025)A Survey on Online Aggression: Content Detection and Behavioral Analysis on Social MediaACM Computing Surveys10.1145/371112557:7(1-36)Online publication date: 21-Feb-2025
  • (2025)A deep dive into automated sexism detection using fine-tuned deep learning and large language modelsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110167145(110167)Online publication date: Apr-2025
  • (2024)Investigating Offensive Language Detection in a Low-Resource Setting with a Robustness PerspectiveBig Data and Cognitive Computing10.3390/bdcc81201708:12(170)Online publication date: 25-Nov-2024
  • Show More Cited By

Index Terms

  1. Detection of Offensive Language and ITS Severity for Low Resource Language

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
    June 2023
    635 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3604597
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 June 2023
    Online AM: 19 January 2023
    Accepted: 06 January 2023
    Revised: 06 November 2022
    Received: 27 April 2022
    Published in TALLIP Volume 22, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hate speech
    2. long short-term memory
    3. Urdu NLP
    4. convolutional neural network
    5. BERT

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)199
    • Downloads (Last 6 weeks)21
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)A Survey on Online Aggression: Content Detection and Behavioral Analysis on Social MediaACM Computing Surveys10.1145/371112557:7(1-36)Online publication date: 21-Feb-2025
    • (2025)A deep dive into automated sexism detection using fine-tuned deep learning and large language modelsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2025.110167145(110167)Online publication date: Apr-2025
    • (2024)Investigating Offensive Language Detection in a Low-Resource Setting with a Robustness PerspectiveBig Data and Cognitive Computing10.3390/bdcc81201708:12(170)Online publication date: 25-Nov-2024
    • (2024)HumourHindiNet: Humour detection in Hindi web series using word embedding and convolutional neural networkACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366130623:7(1-21)Online publication date: 26-Jun-2024
    • (2024)Hate Speech Detection in Roman Urdu using Machine Learning Techniques2024 5th International Conference on Advancements in Computational Sciences (ICACS)10.1109/ICACS60934.2024.10473250(1-7)Online publication date: 19-Feb-2024
    • (2024)Hate Speech and Target Community Detection in Nastaliq Urdu Using Transfer Learning TechniquesIEEE Access10.1109/ACCESS.2024.344418812(116875-116890)Online publication date: 2024
    • (2024)Leveraging Transfer Learning for Hate Speech Detection in Portuguese Social Media PostsIEEE Access10.1109/ACCESS.2024.343084812(101374-101389)Online publication date: 2024
    • (2024)A comprehensive review on automatic hate speech detection in the age of the transformerSocial Network Analysis and Mining10.1007/s13278-024-01361-314:1Online publication date: 9-Oct-2024
    • (2024)Detecting Offensive Language in Tamil YouTube CommentsComputing and Machine Learning10.1007/978-981-97-7571-2_31(407-420)Online publication date: 25-Dec-2024
    • (2023)Contextual Embeddings based on Fine-tuned Urdu-BERT for Urdu threatening content and target identificationJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10160635:7Online publication date: 1-Jul-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media