Abstract
Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alharthi, R., Alhothali, A., Moria, K.: A real-time deep-learning approach for filtering Arabic low-quality content and accounts on Twitter. Inf. Syst. 99, 101740 (2021)
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J.: The Pushshift reddit dataset. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp. 830–839 (2020)
Bonifazi, G., Corradini, E., Ursino, D., Virgili, L.: Modeling, evaluating, and applying the eWoM power of reddit posts. Big Data Cognit. Comput. 7(1), 47 (2023)
Camacho, D., Panizo-LLedot, A., Bello-Orgaz, G., Gonzalez-Pardo, A., Cambria, E.: The four dimensions of social network analysis: an overview of research methods, applications, and software tools. Inf. Fus. 63, 88–120 (2020)
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Du, C., Sun, H., Wang, J., Qi, Q., Liao, J.: Adversarial and domain-aware BERT for cross-domain sentiment analysis. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4019–4028 (2020)
Fiallos, A., Jimenes, K.: Using reddit data for multi-label text classification of twitter users interests. In: 2019 Sixth International Conference on eDemocracy & eGovernment (ICEDEG), pp. 324–327. IEEE (2019)
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016)
Geiger, R.S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., Huang, J.: Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 325–336 (2020)
Hu, G., Zhang, Y., Yang, Q.: Transfer meets hybrid: a synthetic approach for cross-domain collaborative filtering with text. In: The World Wide Web Conference, pp. 2822–2829 (2019)
Kepner, J., et al.: Computing on masked data: a high-performance method for improving big data veracity. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)
Kumaresamoorthy, N., Firdhous, M.: An approach of filtering the content of posts in social media. In: 2018 3rd International Conference on Information Technology Research (ICITR), pp. 1–6. IEEE (2018)
Medvedev, A.N., Lambiotte, R., Delvenne, J.C.: The anatomy of Reddit: an overview of academic research. In: Dynamics On and Of Complex Networks III: Machine Learning and Statistical Physics Approaches, vol. 10, pp. 183–204 (2019)
National Institute of Standards and Technology (NIST): TREC Microblog Track (2024). https://trec.nist.gov/data/microblog.html. Accessed 21 Feb 2024
Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020)
Nutakki, G.C., Nasraoui, O.: Compartmentalized adaptive topic mining on social media streams. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 992–997. IEEE (2016)
Ramamonjisoa, D., Ikuma, H., Murakami, R.: Filtering relevant comments in social media using deep learning. In: 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pp. 335–340. IEEE (2022)
Seering, J., Wang, T., Yoon, J., Kaufman, G.: Moderator engagement and community development in the age of algorithms. New Media Soc. 21(7), 1417–1443 (2019)
Sharma, P., Li, Y.: Self-supervised contextual keyword and keyphrase retrieval with self-labelling (2019)
Silva, A., Luo, L., Karunasekera, S., Leckie, C.: Embracing domain differences in fake news: cross-domain fake news detection using multi-modal data. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 557–565 (2021)
Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics-challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manage. 39, 156–168 (2018)
Acknowledgements
Research was sponsored by the DEVCOM Analysis Center and was accomplished under Cooperative Agreement Number W911NF-22-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Also supported, in part by the Temple University office of the Vice President for Research 2022 Catalytic Collaborative Research Initiative Program, AI & ML Focus Area.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 IFIP International Federation for Information Processing
About this paper
Cite this paper
Gupta, S., Alshehri, J., Hai, A.A., Otudi, H., Obradovic, Z. (2024). From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 714. Springer, Cham. https://doi.org/10.1007/978-3-031-63223-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-63223-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63222-8
Online ISBN: 978-3-031-63223-5
eBook Packages: Computer ScienceComputer Science (R0)