From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 714))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

356 Accesses

Abstract

Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 219.99; Price includes VAT (United Kingdom)

Hardcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Robust Domain Adaptation Approach for Tweet Classification for Crisis Response

A Novel Approach to Train Diverse Types of Language Models for Health Mention Classification of Tweets

Revolutionizing Data Annotation with Convergence of Deep Learning and Active Learning to Enhance Credibility on Twitter Datasets

References

Alharthi, R., Alhothali, A., Moria, K.: A real-time deep-learning approach for filtering Arabic low-quality content and accounts on Twitter. Inf. Syst. 99, 101740 (2021)
Article Google Scholar
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J.: The Pushshift reddit dataset. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp. 830–839 (2020)
Google Scholar
Bonifazi, G., Corradini, E., Ursino, D., Virgili, L.: Modeling, evaluating, and applying the eWoM power of reddit posts. Big Data Cognit. Comput. 7(1), 47 (2023)
Article Google Scholar
Camacho, D., Panizo-LLedot, A., Bello-Orgaz, G., Gonzalez-Pardo, A., Cambria, E.: The four dimensions of social network analysis: an overview of research methods, applications, and software tools. Inf. Fus. 63, 88–120 (2020)
Article Google Scholar
Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Du, C., Sun, H., Wang, J., Qi, Q., Liao, J.: Adversarial and domain-aware BERT for cross-domain sentiment analysis. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4019–4028 (2020)
Google Scholar
Fiallos, A., Jimenes, K.: Using reddit data for multi-label text classification of twitter users interests. In: 2019 Sixth International Conference on eDemocracy & eGovernment (ICEDEG), pp. 324–327. IEEE (2019)
Google Scholar
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016)
MathSciNet Google Scholar
Geiger, R.S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., Huang, J.: Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 325–336 (2020)
Google Scholar
Hu, G., Zhang, Y., Yang, Q.: Transfer meets hybrid: a synthetic approach for cross-domain collaborative filtering with text. In: The World Wide Web Conference, pp. 2822–2829 (2019)
Google Scholar
Kepner, J., et al.: Computing on masked data: a high-performance method for improving big data veracity. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)
Google Scholar
Kumaresamoorthy, N., Firdhous, M.: An approach of filtering the content of posts in social media. In: 2018 3rd International Conference on Information Technology Research (ICITR), pp. 1–6. IEEE (2018)
Google Scholar
Medvedev, A.N., Lambiotte, R., Delvenne, J.C.: The anatomy of Reddit: an overview of academic research. In: Dynamics On and Of Complex Networks III: Machine Learning and Statistical Physics Approaches, vol. 10, pp. 183–204 (2019)
Google Scholar
National Institute of Standards and Technology (NIST): TREC Microblog Track (2024). https://trec.nist.gov/data/microblog.html. Accessed 21 Feb 2024
Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020)
Nutakki, G.C., Nasraoui, O.: Compartmentalized adaptive topic mining on social media streams. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 992–997. IEEE (2016)
Google Scholar
Ramamonjisoa, D., Ikuma, H., Murakami, R.: Filtering relevant comments in social media using deep learning. In: 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pp. 335–340. IEEE (2022)
Google Scholar
Seering, J., Wang, T., Yoon, J., Kaufman, G.: Moderator engagement and community development in the age of algorithms. New Media Soc. 21(7), 1417–1443 (2019)
Article Google Scholar
Sharma, P., Li, Y.: Self-supervised contextual keyword and keyphrase retrieval with self-labelling (2019)
Google Scholar
Silva, A., Luo, L., Karunasekera, S., Leckie, C.: Embracing domain differences in fake news: cross-domain fake news detection using multi-modal data. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 557–565 (2021)
Google Scholar
Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics-challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manage. 39, 156–168 (2018)
Article Google Scholar

Download references

Acknowledgements

Research was sponsored by the DEVCOM Analysis Center and was accomplished under Cooperative Agreement Number W911NF-22-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Also supported, in part by the Temple University office of the Vice President for Research 2022 Catalytic Collaborative Research Initiative Program, AI & ML Focus Area.

Author information

Authors and Affiliations

Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, PA, USA
Shelly Gupta, Jumanah Alshehri, Ameen Abdel Hai, Hussain Otudi & Zoran Obradovic
Imam Abdulrahman bin Faisal University, Dammam, Saudi Arabia
Jumanah Alshehri

Authors

Shelly Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Jumanah Alshehri
View author publications
You can also search for this author in PubMed Google Scholar
Ameen Abdel Hai
View author publications
You can also search for this author in PubMed Google Scholar
Hussain Otudi
View author publications
You can also search for this author in PubMed Google Scholar
Zoran Obradovic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shelly Gupta .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Ilias Maglogiannis
Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
University of Abertay, Dundee, UK
John Macintyre
Informatics, Ionian University, Corfu, Greece
Markos Avlonitis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, S., Alshehri, J., Hai, A.A., Otudi, H., Obradovic, Z. (2024). From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 714. Springer, Cham. https://doi.org/10.1007/978-3-031-63223-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-63223-5_22
Published: 21 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-63222-8
Online ISBN: 978-3-031-63223-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Domain Adaptation Approach for Tweet Classification for Crisis Response

A Novel Approach to Train Diverse Types of Language Models for Health Mention Classification of Tweets

Revolutionizing Data Annotation with Convergence of Deep Learning and Active Learning to Enhance Credibility on Twitter Datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Domain Adaptation Approach for Tweet Classification for Crisis Response

A Novel Approach to Train Diverse Types of Language Models for Health Mention Classification of Tweets

Revolutionizing Data Annotation with Convergence of Deep Learning and Active Learning to Enhance Credibility on Twitter Datasets

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation