Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation

220 Accesses
Explore all metrics

Abstract

With the increase in user-generated content on social media networks, hate speech and offensive language content are also increasing. From the perspective of computer science, automatic detection of such hate speech and offensive language content is an interesting problem to solve. The natural language community has taken a step to identify such content via automated hate speech and offensive content detection. The hate speech content is generated mostly on social media, and automatic hate speech and offensive language detection face many challenges due to non-standard spelling and grammar variations. Specifically, in a multilingual community, the hate content would be in code-mixed form, making the task further challenging. In this article, we propose a model for code-mixed hate speech detection. This model embeds the knowledge from both user-trained and multilingual pre-trained models. The proposed method also calculates the profanity word list and augments it. Experimental results on code-mixed hate speech and offensive language detection benchmarks show that our method outperforms the existing baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A Language-Free Hate Speech Identification on Code-mixed Conversational Tweets

Study of Markov Chains for the Identification of the Hate Contents in Hinglish

A literature survey on multimodal and multilingual automatic hate speech identification

Article 20 January 2023

Notes

https://github.com/suman101112/Hate-Speech-Detection-on-Code-Mixed-Dataset-using-a-Fusion-of-Custom-and-Pre-Trained-models-with-Pro.
https://www.nltk.org/.
https://pypi.org/project/googletrans/.
If the source and target languages are kept the same in the API, the google trans API works as a transliterator. i.e., it transliterates the given text to the one specified by the source/target language.
https://pypi.org/project/symspellpy/.
https://huggingface.co/blog/how-to-train.
https://www.nltk.org/api/nltk.sentiment.html.
https://www.cs.cmu.edu/~biglou/resources/bad-words.txt.
As all the models from the shared task are evaluated using the weighted-F1 score.
https://pypi.org/project/ai4bharat-transliteration/.
https://pypi.org/project/emoji/.

References

Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V. Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval. 2020;20:1–47.
Google Scholar
Myers-Scotton C. Dueling languages: grammatical structure in code-switching. Oxford: Claredon; 1993.
Google Scholar
Myers-Scotton C, et al. Contact linguistics: bilingual encounters and grammatical outcomes. Oxford: Oxford University Press; 2002.
Book Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding; 2018. arXiv:1810.04805 (arXiv preprint).
Chakravarthi BR, Priyadharshini R, Muralidaran V, Suryawanshi S, Jose N, Sherly E, McCrae JP. Overview of the track on sentiment analysis for dravidian languages in code-mixed text. In: Forum for information retrieval evaluation; 2020. p. 21–24.
Mandl T, Modha S, Kumar MA, Chakravarthi BR. Overview of the hasoc track at fire 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for information retrieval evaluation; 2020. p. 29–32.
Chakravarthi BR, Muralidaran V. Findings of the shared task on hope speech detection for equality, diversity, and inclusion. In: Proceedings of the first workshop on language technology for equality, diversity and inclusion; 2021. p. 61–72.
Chakravarthi BR, Priyadharshini R, Jose N, Mandl T, Kumaresan PK, Ponnusamy R, Hariharan RL, McCrae JP, Sherly E, Philip J. mc-crae. 2021. findings of the shared task on offensive language identification in Tamil, Malayalam, and Kannada. In: Proceedings of the first workshop on speech and language technologies for dravidian languages. Association for Computational Linguistics.
Chi Z, Dong L, Wei F, Mao X, Huang H. Can monolingual pretrained models help cross-lingual classification? In: Proceedings of the 1st conference of the Asia-Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing. Suzhou, China: Association for Computational Linguistics; 2020. p. 12–17. https://www.aclweb.org/anthology/2020.aacl-main.2.
Arivazhagan N, Bapna A, Firat O, Lepikhin D, Johnson M, Krikun M, Chen MX, Cao Y, Foster G, Cherry C, et al. Massively multilingual neural machine translation in the wild: Findings and challenges; 2019. arXiv:1907.05019 (arXiv preprint)
Chakravarthi BR, Priyadharshini R, Jose NM, AK., Mandl T, Kumaresan PK, Ponnusamy R, V H, Sherly E, McCrae JP. Findings of the shared task on Offensive Language Identification in Tamil, Malayalam, and Kannada. In: Proceedings of the first workshop on speech and language technologies for Dravidian languages. Association for Computational Linguistics; 2021.
Hande A, Priyadharshini R, Chakravarthi BR. KanCMD: Kannada CodeMixed dataset for sentiment analysis and offensive language detection. In: Proceedings of the third workshop on computational modeling of people’s opinions, personality, and emotion’s in social media. Barcelona, Spain: Association for Computational Linguistics (Online) 2020. p. 54–63. https://www.aclweb.org/anthology/2020.peoples-1.6.
Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M. A dataset of Hindi–English code-mixed social media text for hate speech detection. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media; 2018. p. 36–41.
Mathur P, Shah R, Sawhney R, Mahata D. Detecting offensive tweets in Hindi–English code-switched language. In: Proceedings of the sixth international workshop on natural language processing for social media; 2018. p. 18–26.
Waseem Z, Hovy D. Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop; 2016. p. 88–93.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.
Article Google Scholar
Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar: Association for Computational Linguistics (2014). p. 1746–1751. https://doi.org/10.3115/v1/D14-1181. https://www.aclweb.org/anthology/D14-1181.
Rani P, Suryawanshi S, Goswami K, Chakravarthi BR, Fransen T, McCrae JP. A comparative study of different state-of-the-art hate speech detection methods in Hindi–English code-mixed data. In: Proceedings of the second workshop on trolling, aggression and cyberbullying; 2020. p. 42–48.
Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A. Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th forum for information retrieval evaluation; 2019. p. 14–17.
Kumar R, Reganti AN, Bhatia A, Maheshwari T. Aggression-annotated corpus of Hindi-English code-mixed data; 2018. arXiv:1803.09402 (arXiv preprint)
Kamble S, Joshi A. Hate speech detection from code-mixed Hindi-English tweets using deep learning models; 2018. arXiv:1811.05145 (arXiv preprint).
Chopra S, Sawhney R, Mathur P, Shah RR. Hindi-English hate speech detection: Author profiling, debiasing, and practical perspectives. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34; 2020. p. 386–393.
Vashistha N, Zubiaga A. Online multilingual hate speech detection: experimenting with Hindi and English social media. Information. 2021;12(1):5.
Article Google Scholar
Chakravarthi BR, Muralidaran V. Findings of the shared task on hope speech detection for equality, diversity, and inclusion. In: Proceedings of the first workshop on language technology for equality, diversity and inclusion, association for computational linguistics, Kyiv; 2021. p. 61–72. https://www.aclweb.org/anthology/2021.ltedi-1.8.
Dowlagar S, Mamidi R. Offlangone@ dravidianlangtech-eacl2021: Transformers with the class balanced loss for offensive language identification in dravidian code-mixed text. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 154–159.
Dowlagar S, Mamidi R. Hasocone@ fire-hasoc2020: Using bert and multilingual bert models for hate speech detection; 2021. arXiv:2101.09007 (arXiv preprint)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need; 2017. arXiv:1706.03762 (arXiv preprint)
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
MATH Google Scholar
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages; 2018. arXiv:1802.06893 (arXiv preprint).
Pires T, Schlinger E, Garrette D. How multilingual is multilingual bert? 2019. arXiv:1906.01502 (arXiv preprint).
Wu S, Beto, Dredze M. bentz, becas: The surprising cross-lingual effectiveness of bert; 2019. arXiv:1904.09077 (arXiv preprint).
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at scale; 2019. arXiv:1911.02116 (arXiv preprint).
Saha D, Paharia N, Chakraborty D, Saha P, Mukherjee A. Hate-alert@ dravidianlangtech-eacl2021: Ensembling strategies for transformer-based offensive language detection. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 270–276.
Kedia K, Nandy A. indicnlp@ kgp at dravidianlangtech-eacl2021: Offensive language identification in dravidian languages. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 330–335.
Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R. Predicting the type and target of offensive posts in social media; 2019. arXiv:1902.09666 (arXiv preprint)
Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval); 2019. arXiv:1903.08983 (arXiv preprint)
Jayanthi SM, Gupta A. Sj_aj@ dravidianlangtech-eacl2021: Task-adaptive pre-training of multilingual bert models for offensive language identification. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 307–312.
Vasantharajan C, Thayasivam U. Hypers@ dravidianlangtech-eacl2021: Offensive language identification in dravidian code-mixed youtube comments and posts. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 195–202.
Ghanghor N, Krishnamurthy P, Thavareesan S, Priyadharshini R, Chakravarthi BR. Iiitk@ dravidianlangtech-eacl2021: Offensive language identification and meme classification in Tamil, Malayalam and Kannada. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 222–229.
Zhao Y, Tao X. Zyj123@ dravidianlangtech-eacl2021: Offensive language identification based on xlm-roberta with dpcnn. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 216–221.
Johnson R, Zhang T. Deep pyramid convolutional neural networks for text categorization. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers); 2017. p. 562–570.
Cui Y, Jia M, Lin T-Y, Song Y, Belongie S. Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. p. 9268–9277.
Li Z. Codewithzichao@ dravidianlangtech-eacl2021: Exploring multilingual transformers for offensive language identification on code mixing text. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 164–168.
Sharif O, Hossain E, Hoque MM. Nlp-cuet@ dravidianlangtech-eacl2021: Offensive language detection from multilingual code-mixed text using transformers. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 255–261.
Tula D, Potluri P, Ms S, Doddapaneni S, Sahu P, Sukumaran R, Patwa P. Bitions@ dravidianlangtech-eacl2021: Ensemble of multilingual language models with pseudo labeling for offence detection in dravidian languages. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 291–299.
Lin T-Y, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 2980–2988.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. Huggingface’s transformers: State-of-the-art natural language processing; 2019. arXiv:1910.03771 (arXiv preprint).

Download references

Author information

Authors and Affiliations

LTRC, IIIT-Hyderabad, Hyderabad, Telangana, 500032, India
Suman Dowlagar & Radhika Mamidi

Authors

Suman Dowlagar
View author publications
You can also search for this author in PubMed Google Scholar
Radhika Mamidi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suman Dowlagar.

Ethics declarations

Conflict of interest

On behalf of all authors, Suman Dowlagar states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Machine Learning for Offensive and Highly Emotional Content on Social Media” guest edited by Bharathi Raja Asoka Chakravarthi, Anand Kumar M, Sandip Modha, Thomas Mandl and Prasenjit Majumder.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dowlagar, S., Mamidi, R. Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation. SN COMPUT. SCI. 3, 306 (2022). https://doi.org/10.1007/s42979-022-01189-8

Download citation

Received: 13 June 2021
Accepted: 04 May 2022
Published: 24 May 2022
DOI: https://doi.org/10.1007/s42979-022-01189-8

Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Language-Free Hate Speech Identification on Code-mixed Conversational Tweets

Study of Markov Chains for the Identification of the Hate Contents in Hinglish

A literature survey on multimodal and multilingual automatic hate speech identification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Language-Free Hate Speech Identification on Code-mixed Conversational Tweets

Study of Markov Chains for the Identification of the Hate Contents in Hinglish

A literature survey on multimodal and multilingual automatic hate speech identification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation