CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoost

Sayani Ghosal^1,2 &
Amita Jain³

153 Accesses
Explore all metrics

Abstract

Huge amount of internet data are produced and consumed by internet users, where most of the data are in natural language and they express their feelings, emotions and thoughts on social media. It is the responsibility of the social media provider to provide healthy communication system among users. It is very challenging job to detect revenge from the social media text due to long sentences where semantic relation dissolves between tokens. Due to that, the social media providers did not provide any attention towards identifying the users spreading revenge. This article propose a novel model named as CatRevenge which identifies both active and passive revenge. This model preprocess with Slangzy internet slang meaning dictionary to detect revenge text more efficiently. CatRevenge assigns impact weight on each of parts of speech in the sentences based on its relevance and TF-IDF score of the words. The novel CatRevenge model also considers the paragraph embedding model for contextual semantic analysis of revenge text. In addition, this research applies gradient boosting CATBoost classifier with categorical features to reduce model overfitting. This feature ranking method can able to reduce the dimensionality of data by ranking the most significant feature. This research considers the revenge posts English language dataset from the Reddit social media where it evaluated with binary and multiclass classification. Results demonstrate achievable performance with a 6—10% increase in binary and a 2.5 -5% increase in multiclass with weighted F1 metric.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A deep learning framework for clickbait spoiler generation and type identification

Article 07 March 2024

Rumour identification on Twitter as a function of novel textual and language-context features

Article 12 August 2022

Few-Shot Learning with Fine-Tuned Language Model for Suicidal Text Detection

Data availability

As per request.

Code availability

As per request.

Notes

https://github.com/ebsiegs/subreddit_nlp/blob/main/data/subreddit_data.csv

References

Finances Online [electronic resource] (2022) 53 important statistics about how much data is created every day in 2024 - Financesonline.com. https://financesonline.com/how-much-data-is-created-every-day
Statusbrew [electronic resource] (2022) https://statusbrew.com/insights/social-media-statistics/. Accessed 15 July 2022
Zhang Z, Gupta BB (2018) Social media security and trustworthiness: overview and new direction. Futur Gener Comput Syst 86:914–925. https://doi.org/10.1016/j.future.2016.10.007
Article MATH Google Scholar
Baccarella CV, Wagner TF, Kietzmann JH, McCarthy IP (2018) Social media? It’s serious! Understanding the dark side of social media. Eur Manag J 36(4):431–438. https://doi.org/10.1016/j.emj.2018.07.002
Article Google Scholar
Zaccagnino R, Capo C, Guarino A, Lettieri N, Malandrino D (2021) Techno-regulation and intelligent safeguards: Analysis of touch gestures for online child protection. Multimed Tools Appl 80:15803–15824. https://doi.org/10.1007/s11042-020-10446-y
Article MATH Google Scholar
van Steen T (2022) When choice is (not) an option: nudging and techno-regulation approaches to behavioural cybersecurity. International conference on human-computer interaction. Springer International Publishing, Cham, pp 120–130
MATH Google Scholar
Clemente M, Padilla-Racero D, Espinosa P (2019) Revenge among parents who have broken up their relationship through family law courts: Its dimensions and measurement proposal. Int J Environ Res Public Health 16(24):4950. https://doi.org/10.3390/ijerph16244950
Article Google Scholar
Paulin M, Boon SD (2021) Revenge via social media and relationship contexts: Prevalence and measurement. J Soc Pers Relat 38(12):3692–3712. https://doi.org/10.1177/02654075211045316
Article MATH Google Scholar
Zhao J, Shao M, Peng H, Wang H, Li B, Liu X (2021) Porn2Vec: A robust framework for detecting pornographic websites based on contrastive learning. Knowl-Based Syst 228:107296. https://doi.org/10.1016/j.knosys.2021.107296
Article Google Scholar
Singh M, Bansal D, Sofat S (2016) Behavioral analysis and classification of spammers distributing pornographic content in social media. Soc Netw Anal Min 6(1):1–18. https://doi.org/10.1007/s13278-016-0350-0
Article MATH Google Scholar
Siegel E, Classifying passive vs. active revenge in related subreddits using NLP. https://github.com/ebsiegs/subreddit_nlp. Accessed 31 Mar 2022
Neuman Y, Erez ES, Tschantret J, Weiss H (2022) Themes of revenge: automatic identification of vengeful content in textual data. arXiv preprint arXiv:2205.01731
Statista [electronic resource] (2020) Ranking of the number of Reddit users by country 2020, https://www.statista.com/forecasts/1174696/reddit-user-by-country. Accessed 25th May 2022
Wikipedia [electronic resource] (2020) Controversial Reddit communities. https://en.wikipedia.org/wiki/Controversial_Reddit_communities. Accessed Nov 2022
Statista [electronic resource] (2020) Number of content removal requests made to Reddit by governments in 2020, by country, https://www.statista.com/statistics/1255296/government-content-removal-requests-to-reddit-by-country/, Accessed 10 Aug 2022
König A, Gollwitzer M, Steffgen G (2010) Cyberbullying as an act of revenge? J Psychol Couns Sch 20(2):210–224. https://doi.org/10.1375/ajgc.20.2.210
Article MATH Google Scholar
Alla K R, Kandibanda N, Katta P, Muthavarapu A, Kuchibhotla S (2022). Emotion Detection from Text Using LSTM. In Proceedings of Sixth International Congress on Information and Communication Technology, 545–553. Springer, Singapore. https://doi.org/10.1007/978-981-16-1781-2_49
Graumas L, David R, Caselli T (2019) Twitter-based polarised embeddings for abusive language detection. In: 2019 8th international conference on affective computing and intelligent interaction workshops and demos (ACIIW). IEEE, pp 1–7
Sharif O, Hoque M M (2021) Tackling Cyber-Aggression: Identification and Fine-Grained Categorization of Aggressive Texts on Social Media using Weighted Ensemble of Transformers. Neurocomputinghttps://doi.org/10.1016/j.neucom.2021.12.022
Ghosal S, Jain A (2021) Research journey of hate content detection from cyberspace. In: Natural language processing for global and local business. IGI Global, pp 200–225
Ginting PSB, Irawan B, Setianingsih C (2019) Hate speech detection on twitter using multinomial logistic regression classification method. In: 2019 IEEE international conference on internet of things and intelligence system (IoTaIS). IEEE, pp 105–111
Novalita N, Herdiani A, Lukmana I, Puspandari D (2019) Cyberbullying identification on twitter using random forest classifier. In Journal of physics: conference series, vol 1192, no 1. IOP Publishing, p 012029
Sadiq S, Mehmood A, Ullah S, Ahmad M, Choi GS, On BW (2021) Aggression detection through deep neural model on twitter. Futur Gener Comput Syst 114:120–129. https://doi.org/10.1016/j.future.2020.07.050
Article Google Scholar
Qureshi KA, Sabih M (2021) Un-compromised credibility: Social media based multi-class hate speech classification for text. IEEE Access 9:109465–109477. https://doi.org/10.1109/ACCESS.2021.3101977
Article MATH Google Scholar
Chiril P, Pamungkas EW, Benamara F, Moriceau V, Patti V (2022) Emotionally informed hate speech detection: a multi-target perspective. Cogn Comput 14(1):322–352. https://doi.org/10.1007/s12559-021-09862-5
Article Google Scholar
Dheeraj K, Ramakrishnudu T (2021) Negative emotions detection on online mental-health related patients texts using the deep learning with MHA-BCNN model. Expert Syst Appl 182:115265. https://doi.org/10.1016/j.eswa.2021.115265
Article Google Scholar
Maity K, Kumar A, Saha S (2022) A multitask multimodal framework for sentiment and emotion-aided cyberbullying detection. IEEE Internet Comput 26(4):68–78
Article MATH Google Scholar
Akhter MP, Jiangbin Z, Naqvi IR, AbdelMajeed M, Zia T (2022) Abusive language detection from social media comments using conventional machine learning and deep learning approaches. Multimed Syst 28(6):1925–1940
Article Google Scholar
Srinivasarao U, Sharaff A (2023) Machine intelligence based hybrid classifier for spam detection and sentiment analysis of SMS messages. Multimed Tools Appl 82(20):31069–31099
Article MATH Google Scholar
Tripathy G, Sharaff A (2023) AEGA: enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis. J Supercomput, 1–30. https://doi.org/10.1007/s11227-023-05179-2
Ai Q, Yang L, Guo J, Croft WB (2016) Analysis of the paragraph vector model for information retrieval. In: Proceedings of the 2016 ACM international conference on the theory of information retrieval, pp 133–142
Salehi Rizi F, Granitzer M (2017) Properties of vector embeddings in social networks. Algorithms 10(4):109. https://doi.org/10.3390/a10040109
Article MathSciNet MATH Google Scholar
Hidayat THJ, Ruldeviyani Y, Aditama AR, Madya GR, Nugraha AW, Adisaputra MW (2022) Sentiment analysis of twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier. Procedia Comput Sci 197:660–667. https://doi.org/10.1016/j.procs.2021.12.187
Article Google Scholar
Yang L, Li C, Ding Q, Li L (2013) Combining lexical and semantic features for short text classification. Procedia Comput Sci 22:78–86. https://doi.org/10.1016/j.procs.2013.09.083
Article MATH Google Scholar
Mishra M, Mishra VK, Sharma HR (2013) Question classification using semantic, syntactic and lexical features. Int J Web Semant Technol 4(3):39
Article MATH Google Scholar
Del Gobbo E, Guarino A, Cafarelli B, Grilli L (2023) GradeAid: a framework for automatic short answers grading in educational contexts—design, implementation and evaluation. Knowl Inf Syst 65(10):4295–4334
Article Google Scholar
Kamarudin MH, Maple C, Watson T, Safa NS (2017) A logitboost-based algorithm for detecting known and unknown web attacks. IEEE Access 5:26190–26200. https://doi.org/10.1109/ACCESS.2017.2766844
Article MATH Google Scholar
Li J, Zhang H, Wei Z (2020) The weighted word2vec paragraph vectors for anomaly detection over HTTP traffic. IEEE Access 8:141787–141798. https://doi.org/10.1109/ACCESS.2020.3013849
Article Google Scholar
Bentéjac C, Csörgő A, Martínez-Muñoz G (2021) A comparative analysis of gradient boosting algorithms. Artif Intell Rev 54(3):1937–1967. https://doi.org/10.1007/s10462-020-09896-5
Article MATH Google Scholar
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W et al (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Proces Syst 30
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: unbiased boosting with categorical features. Adv Neural Inf Proces Syst 31
Gilabert P, Seguí S (2020) Gradient boosting and language model ensemble for tweet recommendation. In: Proceedings of the recommender systems challenge, pp 24–28
Pereira FS, Andrade T, de Carvalho AC (2020) Gradient boosting machine and LSTM network for online harassment detection and categorization in social media. In: Machine learning and knowledge discovery in databases: international workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, proceedings, part II. Springer International Publishing, pp 314–320
Alzamzami F, Hoda M, El Saddik A (2020) Light gradient boosting machine for general sentiment classification on short texts: a comparative evaluation. IEEE Access 8:101840–101858. https://doi.org/10.1109/ACCESS.2020.2997330
Article Google Scholar
Li TR, Chamrajnagar AS, Fong XR, Rizik NR, Fu F (2019) Sentiment-based prediction of alternative cryptocurrency price fluctuations using gradient boosting tree model. Frontiers in Physics 7:98. https://doi.org/10.3389/fphy.2019.00098
Article Google Scholar
Saha P, Mathew B, Goyal P, Mukherjee A (2018) Hateminers: detecting hate speech against women. arXiv preprint arXiv:181206700
Loper E, Bird S (2002) Nltk: the natural language toolkit. arXiv preprint cs/0205028
Gupta A, Taneja SB, Malik G, Vij S, Tayal DK, Jain A (2019) SLANGZY: A fuzzy logic-based algorithm for English slang meaning Selection. Progress Artif Intell 8(1):111–121
Article MATH Google Scholar
Cutting D, Kupiec J, Pedersen J, Sibun P (1992) A practical part-of-speech tagger. In: Third conference on applied natural language processing, pp 133–140
Salton G, Yang CS (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372
Article MATH Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Proces Syst 26
Dorogush AV, Ershov V, Gulin A (2018) CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:181011363
Everitt BS (1992) The analysis of contingency tables. CRC Press
Book MATH Google Scholar
Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August; 2623–2631, https://doi.org/10.1145/3292500.3330701
kumar, Sahoo , YG (2012) Analysis of Parametric & Non Parametric Classifiers for Classification Technique using WEKA. Int J Inform Technol Comput Sci 4(7):43–49. https://doi.org/10.5815/ijitcs.2012.07.06
Article MATH Google Scholar
Nti IK, Nyarko-Boateng O, Aning J (2021) Performance of machine learning algorithms with different K values in K-fold cross-validation. Int J Inf Technol Comput Sci 13(6):61–71
Google Scholar

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

NSUT East Campus (Erstwhile A.I.A.C.T.R.), Guru Gobind Singh Indraprastha University, Dwaka, New Delhi, India
Sayani Ghosal
KIET Group of Institutions, Ghaziabad, Ghaziabad Delhi-NCR, India
Sayani Ghosal
Netaji Subhas University of Technology, New Delhi, India
Amita Jain

Authors

Sayani Ghosal
View author publications
You can also search for this author in PubMed Google Scholar
Amita Jain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amita Jain.

Ethics declarations

Competing interests

The authors have no financial or proprietary interests in any material discussed in this article.

Conflicts of interest/Competing interests

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ghosal, S., Jain, A. CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoost. Multimed Tools Appl 83, 89607–89633 (2024). https://doi.org/10.1007/s11042-024-18791-y

Download citation

Received: 14 May 2023
Revised: 19 September 2023
Accepted: 24 February 2024
Published: 01 April 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s11042-024-18791-y

CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoost

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A deep learning framework for clickbait spoiler generation and type identification

Rumour identification on Twitter as a function of novel textual and language-context features

Few-Shot Learning with Fine-Tuned Language Model for Suicidal Text Detection

Data availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Conflicts of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

CatRevenge: towards effective revenge text detection in online social media with paragraph embedding and CATBoost

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A deep learning framework for clickbait spoiler generation and type identification

Rumour identification on Twitter as a function of novel textual and language-context features

Few-Shot Learning with Fine-Tuned Language Model for Suicidal Text Detection

Data availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Conflicts of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation