Abstract
The article discusses the development of an online tool for moderating the content of social network groups. The use of classification using machine learning methods is proposed as the main element of the system. The creation of the feature set of messages is assumed by extracting the content features of the text, as well as the use of word embeddings vectors. The authors conducted a series of experiments to find the best combination of vector representation, content features and classification method. Tests on a dataset of 11 thousand messages in Russian showed the result of 87% accuracy. The architecture of the group moderator’s web application with the ability to automatically apply classification results to control users and display posts is proposed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Georgakopoulos, S.V., Tasoulis, S.K., Vrahatis, A.G., Plagianakos, V.P.: Convolutional neural networks for toxic comment classification. arXiv preprint arXiv:1802.09957 (2018)
Medialogiya–monitoring and analysis of media and social networks (rus.). https://www.mlg.ru
Corazza, M., Menini, S., Cabrio, E., Tonelli, S., Villata, S.: A multilingual evaluation for online hate speech detection. ACM Trans. Internet Technol. Assoc. Comput. Mach. 20(2), 1–22 (2020). https://doi.org/10.1145/3377323.hal-02972184
Russian Language Toxic Comments. https://www.kaggle.com/blackmoon/russian-language-toxic-comments
“Toxicology” project: vk_comments_DS. https://github.com/mihatronych/files/blob/main/ds_of_toxic_messages_from_vk/our_toxic_vk_comments_data.csv
Shekhar, R., Pranjić, M., Pollak, S., Pelicon, A., Purver, M.: Automating news comment moderation with limited resources: benchmarking in croatian and estonian. J. Lang. Technol. Comput. Linguist. 34, 49–79 (2020)
Pavlopoulos, J., Malakasiotis, P., Androutsopoulos, I.: Deeper attention to abusive user content moderation. In: EMNLP, pp. 1125–1135. Copenghagen, Denmark (2017)
Levonevskiy, D., Malov, D., Vatamaniuk, I.: Estimating aggressiveness of russian texts by means of machine learning. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 270–279. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_28
Lee, J.-T., Yang, M.-C., Rim, H.-C.: Discovering high-quality threaded discussions in online forums. J. Comput. Sci. Technol. 29(3), 519–531 (2014)
Plaza-del Arco, F.M., Molina-Gonzalez, D., Martın-Valdivia, T., Urena-Lopez, A.: SINAI at SemEval-2019 Task 6: incorporating lexicon knowledge into SVM learning to identify and categorize offensive language in social media. In: The 13th International Workshop on Semantic Evaluation (SemEval) (2019)
Chernyaev, A., Spryiskov, A., Ivashko, A., Bidulya, Y.: A rumor detection in Russian tweets. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 108–118. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_11
Pavlopoulos, J., Thain, N., Dixon, L., Androutsopoulos, I.: ConvAI at SemEval-2019 Task 6: offensive language identification and categorization with perspective and BERT. In: SemEval, Minneapolis, USA (2019)
Pietro, M.D.: Text Classification with NLP: tf-idf vs Word2Vec vs BERT. https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794
Camacho-Collados, J., Pilehvar, M.T.: From word to sense embeddings: a survey on vector representations of meaning. arXiv:1805.04032. Bibcode:2018arXiv180504032C (2018)
Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, pp. 88–93 (2016)
NLTK documentation. https://www.nltk.org
Morphological analyzer pymorphy2. https://pymorphy2.readthedocs.io
Document-term matrix. https://en.wikipedia.org/wiki/Document-term_matrix
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830. JMLR (2011)
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta, May. ELRA (2010). http://is.muni.cz/publication/884893/en
Gensim: Doc2vec. https://radimrehurek.com/gensim/models/doc2vec.html
Mestre, M.: FastText: stepping through the code. https://medium.com/@mariamestre/fasttext-stepping-through-the-code-259996d6ebc4
Dostoevsky: Sentiment Analysis Library for Russian Language. https://pypi.org/project/dostoevsky
SpaCy: Industrial-Strength Natural Language Processing. https://spacy.io
Wang, S., Manning, C.D.: Baselines and bigrams: simple, good sentiment and topic classification, Department of Computer Science, Stanford University, Stanford 94305. https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf
Wang, Z.: NBSVM. https://www.kaggle.com/ziliwang/nbsvm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Dolgushin, M., Ismakova, D., Bidulya, Y., Krupkin, I., Barskaya, G., Lesiv, A. (2021). Toxic Comment Classification Service in Social Network. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)