[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering

  • Conference paper
  • First Online:
Artificial Intelligence Applications and Innovations (AIAI 2024)

Abstract

Reddit has emerged as a leading platform for microblogging data collection, providing valuable insights into patterns and knowledge discovery. However, the process of gathering and preparing this data presents significant challenges, particularly when it comes to ensuring its accuracy. Existing methodologies often yield an abundance of irrelevant posts, and datasets for relevance prediction on Reddit are rare. To overcome these obstacles, we propose a new semi-supervised framework that filters Reddit posts based on their topic relevance. Our approach combines annotated data from Twitter with weak labels generated from Wikipedia pages associated with relevant subreddits to automatically label Reddit posts. To enhance the model’s generalization performance, we utilize a domain adversarial adaptation network to bridge the distribution gap between Twitter and Reddit data. Our novel framework achieves an accuracy of 73% and an F1 score of 0.77, which is a significant improvement of 20% compared to baseline models. Additionally, we address important research questions regarding the effectiveness of automatic labeling, the use of weakly labeled data, the contextual requirements for training domain adaptation models, and the optimal weak labeling method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 219.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alharthi, R., Alhothali, A., Moria, K.: A real-time deep-learning approach for filtering Arabic low-quality content and accounts on Twitter. Inf. Syst. 99, 101740 (2021)

    Article  Google Scholar 

  2. Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J.: The Pushshift reddit dataset. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp. 830–839 (2020)

    Google Scholar 

  3. Bonifazi, G., Corradini, E., Ursino, D., Virgili, L.: Modeling, evaluating, and applying the eWoM power of reddit posts. Big Data Cognit. Comput. 7(1), 47 (2023)

    Article  Google Scholar 

  4. Camacho, D., Panizo-LLedot, A., Bello-Orgaz, G., Gonzalez-Pardo, A., Cambria, E.: The four dimensions of social network analysis: an overview of research methods, applications, and software tools. Inf. Fus. 63, 88–120 (2020)

    Article  Google Scholar 

  5. Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020)

    Article  Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Du, C., Sun, H., Wang, J., Qi, Q., Liao, J.: Adversarial and domain-aware BERT for cross-domain sentiment analysis. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4019–4028 (2020)

    Google Scholar 

  8. Fiallos, A., Jimenes, K.: Using reddit data for multi-label text classification of twitter users interests. In: 2019 Sixth International Conference on eDemocracy & eGovernment (ICEDEG), pp. 324–327. IEEE (2019)

    Google Scholar 

  9. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., March, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016)

    MathSciNet  Google Scholar 

  10. Geiger, R.S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., Huang, J.: Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 325–336 (2020)

    Google Scholar 

  11. Hu, G., Zhang, Y., Yang, Q.: Transfer meets hybrid: a synthetic approach for cross-domain collaborative filtering with text. In: The World Wide Web Conference, pp. 2822–2829 (2019)

    Google Scholar 

  12. Kepner, J., et al.: Computing on masked data: a high-performance method for improving big data veracity. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. IEEE (2014)

    Google Scholar 

  13. Kumaresamoorthy, N., Firdhous, M.: An approach of filtering the content of posts in social media. In: 2018 3rd International Conference on Information Technology Research (ICITR), pp. 1–6. IEEE (2018)

    Google Scholar 

  14. Medvedev, A.N., Lambiotte, R., Delvenne, J.C.: The anatomy of Reddit: an overview of academic research. In: Dynamics On and Of Complex Networks III: Machine Learning and Statistical Physics Approaches, vol. 10, pp. 183–204 (2019)

    Google Scholar 

  15. National Institute of Standards and Technology (NIST): TREC Microblog Track (2024). https://trec.nist.gov/data/microblog.html. Accessed 21 Feb 2024

  16. Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020)

  17. Nutakki, G.C., Nasraoui, O.: Compartmentalized adaptive topic mining on social media streams. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 992–997. IEEE (2016)

    Google Scholar 

  18. Ramamonjisoa, D., Ikuma, H., Murakami, R.: Filtering relevant comments in social media using deep learning. In: 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pp. 335–340. IEEE (2022)

    Google Scholar 

  19. Seering, J., Wang, T., Yoon, J., Kaufman, G.: Moderator engagement and community development in the age of algorithms. New Media Soc. 21(7), 1417–1443 (2019)

    Article  Google Scholar 

  20. Sharma, P., Li, Y.: Self-supervised contextual keyword and keyphrase retrieval with self-labelling (2019)

    Google Scholar 

  21. Silva, A., Luo, L., Karunasekera, S., Leckie, C.: Embracing domain differences in fake news: cross-domain fake news detection using multi-modal data. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 557–565 (2021)

    Google Scholar 

  22. Stieglitz, S., Mirbabaie, M., Ross, B., Neuberger, C.: Social media analytics-challenges in topic discovery, data collection, and data preparation. Int. J. Inf. Manage. 39, 156–168 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

Research was sponsored by the DEVCOM Analysis Center and was accomplished under Cooperative Agreement Number W911NF-22-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Also supported, in part by the Temple University office of the Vice President for Research 2022 Catalytic Collaborative Research Initiative Program, AI & ML Focus Area.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shelly Gupta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gupta, S., Alshehri, J., Hai, A.A., Otudi, H., Obradovic, Z. (2024). From Tweets to Reddit: Leveraging Semi-supervised Domain Adaptation for Improving Data Filtering. In: Maglogiannis, I., Iliadis, L., Macintyre, J., Avlonitis, M., Papaleonidas, A. (eds) Artificial Intelligence Applications and Innovations. AIAI 2024. IFIP Advances in Information and Communication Technology, vol 714. Springer, Cham. https://doi.org/10.1007/978-3-031-63223-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-63223-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-63222-8

  • Online ISBN: 978-3-031-63223-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics