[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

With the increase in user-generated content on social media networks, hate speech and offensive language content are also increasing. From the perspective of computer science, automatic detection of such hate speech and offensive language content is an interesting problem to solve. The natural language community has taken a step to identify such content via automated hate speech and offensive content detection. The hate speech content is generated mostly on social media, and automatic hate speech and offensive language detection face many challenges due to non-standard spelling and grammar variations. Specifically, in a multilingual community, the hate content would be in code-mixed form, making the task further challenging. In this article, we propose a model for code-mixed hate speech detection. This model embeds the knowledge from both user-trained and multilingual pre-trained models. The proposed method also calculates the profanity word list and augments it. Experimental results on code-mixed hate speech and offensive language detection benchmarks show that our method outperforms the existing baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/suman101112/Hate-Speech-Detection-on-Code-Mixed-Dataset-using-a-Fusion-of-Custom-and-Pre-Trained-models-with-Pro.

  2. https://www.nltk.org/.

  3. https://pypi.org/project/googletrans/.

  4. If the source and target languages are kept the same in the API, the google trans API works as a transliterator. i.e., it transliterates the given text to the one specified by the source/target language.

  5. https://pypi.org/project/symspellpy/.

  6. https://huggingface.co/blog/how-to-train.

  7. https://www.nltk.org/api/nltk.sentiment.html.

  8. https://www.cs.cmu.edu/~biglou/resources/bad-words.txt.

  9. As all the models from the shared task are evaluated using the weighted-F1 score.

  10. https://pypi.org/project/ai4bharat-transliteration/.

  11. https://pypi.org/project/emoji/.

References

  1. Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V. Resources and benchmark corpora for hate speech detection: a systematic review. Lang Resour Eval. 2020;20:1–47.

    Google Scholar 

  2. Myers-Scotton C. Dueling languages: grammatical structure in code-switching. Oxford: Claredon; 1993.

    Google Scholar 

  3. Myers-Scotton C, et al. Contact linguistics: bilingual encounters and grammatical outcomes. Oxford: Oxford University Press; 2002.

    Book  Google Scholar 

  4. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding; 2018. arXiv:1810.04805 (arXiv preprint).

  5. Chakravarthi BR, Priyadharshini R, Muralidaran V, Suryawanshi S, Jose N, Sherly E, McCrae JP. Overview of the track on sentiment analysis for dravidian languages in code-mixed text. In: Forum for information retrieval evaluation; 2020. p. 21–24.

  6. Mandl T, Modha S, Kumar MA, Chakravarthi BR. Overview of the hasoc track at fire 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In: Forum for information retrieval evaluation; 2020. p. 29–32.

  7. Chakravarthi BR, Muralidaran V. Findings of the shared task on hope speech detection for equality, diversity, and inclusion. In: Proceedings of the first workshop on language technology for equality, diversity and inclusion; 2021. p. 61–72.

  8. Chakravarthi BR, Priyadharshini R, Jose N, Mandl T, Kumaresan PK, Ponnusamy R, Hariharan RL, McCrae JP, Sherly E, Philip J. mc-crae. 2021. findings of the shared task on offensive language identification in Tamil, Malayalam, and Kannada. In: Proceedings of the first workshop on speech and language technologies for dravidian languages. Association for Computational Linguistics.

  9. Chi Z, Dong L, Wei F, Mao X, Huang H. Can monolingual pretrained models help cross-lingual classification? In: Proceedings of the 1st conference of the Asia-Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing. Suzhou, China: Association for Computational Linguistics; 2020. p. 12–17. https://www.aclweb.org/anthology/2020.aacl-main.2.

  10. Arivazhagan N, Bapna A, Firat O, Lepikhin D, Johnson M, Krikun M, Chen MX, Cao Y, Foster G, Cherry C, et al. Massively multilingual neural machine translation in the wild: Findings and challenges; 2019. arXiv:1907.05019 (arXiv preprint)

  11. Chakravarthi BR, Priyadharshini R, Jose NM, AK., Mandl T, Kumaresan PK, Ponnusamy R, V H, Sherly E, McCrae JP. Findings of the shared task on Offensive Language Identification in Tamil, Malayalam, and Kannada. In: Proceedings of the first workshop on speech and language technologies for Dravidian languages. Association for Computational Linguistics; 2021.

  12. Hande A, Priyadharshini R, Chakravarthi BR. KanCMD: Kannada CodeMixed dataset for sentiment analysis and offensive language detection. In: Proceedings of the third workshop on computational modeling of people’s opinions, personality, and emotion’s in social media. Barcelona, Spain: Association for Computational Linguistics (Online) 2020. p. 54–63. https://www.aclweb.org/anthology/2020.peoples-1.6.

  13. Bohra A, Vijay D, Singh V, Akhtar SS, Shrivastava M. A dataset of Hindi–English code-mixed social media text for hate speech detection. In: Proceedings of the second workshop on computational modeling of people’s opinions, personality, and emotions in social media; 2018. p. 36–41.

  14. Mathur P, Shah R, Sawhney R, Mahata D. Detecting offensive tweets in Hindi–English code-switched language. In: Proceedings of the sixth international workshop on natural language processing for social media; 2018. p. 18–26.

  15. Waseem Z, Hovy D. Hateful symbols or hateful people? Predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop; 2016. p. 88–93.

  16. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

    Article  Google Scholar 

  17. Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar: Association for Computational Linguistics (2014). p. 1746–1751. https://doi.org/10.3115/v1/D14-1181. https://www.aclweb.org/anthology/D14-1181.

  18. Rani P, Suryawanshi S, Goswami K, Chakravarthi BR, Fransen T, McCrae JP. A comparative study of different state-of-the-art hate speech detection methods in Hindi–English code-mixed data. In: Proceedings of the second workshop on trolling, aggression and cyberbullying; 2020. p. 42–48.

  19. Mandl T, Modha S, Majumder P, Patel D, Dave M, Mandlia C, Patel A. Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in Indo-European languages. In: Proceedings of the 11th forum for information retrieval evaluation; 2019. p. 14–17.

  20. Kumar R, Reganti AN, Bhatia A, Maheshwari T. Aggression-annotated corpus of Hindi-English code-mixed data; 2018. arXiv:1803.09402 (arXiv preprint)

  21. Kamble S, Joshi A. Hate speech detection from code-mixed Hindi-English tweets using deep learning models; 2018. arXiv:1811.05145 (arXiv preprint).

  22. Chopra S, Sawhney R, Mathur P, Shah RR. Hindi-English hate speech detection: Author profiling, debiasing, and practical perspectives. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34; 2020. p. 386–393.

  23. Vashistha N, Zubiaga A. Online multilingual hate speech detection: experimenting with Hindi and English social media. Information. 2021;12(1):5.

    Article  Google Scholar 

  24. Chakravarthi BR, Muralidaran V. Findings of the shared task on hope speech detection for equality, diversity, and inclusion. In: Proceedings of the first workshop on language technology for equality, diversity and inclusion, association for computational linguistics, Kyiv; 2021. p. 61–72. https://www.aclweb.org/anthology/2021.ltedi-1.8.

  25. Dowlagar S, Mamidi R. Offlangone@ dravidianlangtech-eacl2021: Transformers with the class balanced loss for offensive language identification in dravidian code-mixed text. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 154–159.

  26. Dowlagar S, Mamidi R. Hasocone@ fire-hasoc2020: Using bert and multilingual bert models for hate speech detection; 2021. arXiv:2101.09007 (arXiv preprint)

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need; 2017. arXiv:1706.03762 (arXiv preprint)

  28. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.

    MATH  Google Scholar 

  29. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T. Learning word vectors for 157 languages; 2018. arXiv:1802.06893 (arXiv preprint).

  30. Pires T, Schlinger E, Garrette D. How multilingual is multilingual bert? 2019. arXiv:1906.01502 (arXiv preprint).

  31. Wu S, Beto, Dredze M. bentz, becas: The surprising cross-lingual effectiveness of bert; 2019. arXiv:1904.09077 (arXiv preprint).

  32. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised cross-lingual representation learning at scale; 2019. arXiv:1911.02116 (arXiv preprint).

  33. Saha D, Paharia N, Chakraborty D, Saha P, Mukherjee A. Hate-alert@ dravidianlangtech-eacl2021: Ensembling strategies for transformer-based offensive language detection. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 270–276.

  34. Kedia K, Nandy A. indicnlp@ kgp at dravidianlangtech-eacl2021: Offensive language identification in dravidian languages. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 330–335.

  35. Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R. Predicting the type and target of offensive posts in social media; 2019. arXiv:1902.09666 (arXiv preprint)

  36. Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R. Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval); 2019. arXiv:1903.08983 (arXiv preprint)

  37. Jayanthi SM, Gupta A. Sj_aj@ dravidianlangtech-eacl2021: Task-adaptive pre-training of multilingual bert models for offensive language identification. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 307–312.

  38. Vasantharajan C, Thayasivam U. Hypers@ dravidianlangtech-eacl2021: Offensive language identification in dravidian code-mixed youtube comments and posts. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 195–202.

  39. Ghanghor N, Krishnamurthy P, Thavareesan S, Priyadharshini R, Chakravarthi BR. Iiitk@ dravidianlangtech-eacl2021: Offensive language identification and meme classification in Tamil, Malayalam and Kannada. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 222–229.

  40. Zhao Y, Tao X. Zyj123@ dravidianlangtech-eacl2021: Offensive language identification based on xlm-roberta with dpcnn. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 216–221.

  41. Johnson R, Zhang T. Deep pyramid convolutional neural networks for text categorization. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers); 2017. p. 562–570.

  42. Cui Y, Jia M, Lin T-Y, Song Y, Belongie S. Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. p. 9268–9277.

  43. Li Z. Codewithzichao@ dravidianlangtech-eacl2021: Exploring multilingual transformers for offensive language identification on code mixing text. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 164–168.

  44. Sharif O, Hossain E, Hoque MM. Nlp-cuet@ dravidianlangtech-eacl2021: Offensive language detection from multilingual code-mixed text using transformers. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 255–261.

  45. Tula D, Potluri P, Ms S, Doddapaneni S, Sahu P, Sukumaran R, Patwa P. Bitions@ dravidianlangtech-eacl2021: Ensemble of multilingual language models with pseudo labeling for offence detection in dravidian languages. In: Proceedings of the first workshop on speech and language technologies for dravidian languages; 2021. p. 291–299.

  46. Lin T-Y, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 2980–2988.

  47. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. Huggingface’s transformers: State-of-the-art natural language processing; 2019. arXiv:1910.03771 (arXiv preprint).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suman Dowlagar.

Ethics declarations

Conflict of interest

On behalf of all authors, Suman Dowlagar states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Machine Learning for Offensive and Highly Emotional Content on Social Media” guest edited by Bharathi Raja Asoka Chakravarthi, Anand Kumar M, Sandip Modha, Thomas Mandl and Prasenjit Majumder.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dowlagar, S., Mamidi, R. Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation. SN COMPUT. SCI. 3, 306 (2022). https://doi.org/10.1007/s42979-022-01189-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01189-8

Keywords

Navigation