[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2766462.2767701acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research

Published: 09 August 2015 Publication History

Abstract

Hashtag facilitates information diffusion in Twitter by creating dynamic and virtual communities for information aggregation from all Twitter users. Because hashtags serve as additional channels for one's tweets to be potentially accessed by other users than her own followers, hashtags are targeted for spamming purposes (e.g., hashtag hijacking), particularly the popular and trending hashtags. Although much effort has been devoted to fighting against email/web spam, limited studies are on hashtag-oriented spam in tweets. In this paper, we collected 14 million tweets that matched some trending hashtags in two months' time and then conducted systematic annotation of the tweets being spam and ham (i.e., non-spam). We name the annotated dataset HSpam14. Our annotation process includes four major steps: (i) heuristic-based selection to search for tweets that are more likely to be spam, (ii) near-duplicate cluster based annotation to firstly group similar tweets into clusters and then label the clusters, (iii) reliable ham tweets detection to label tweets that are non-spam, and (iv) Expectation-Maximization (EM)-based label prediction to predict the labels of remaining unlabeled tweets. One major contribution of this work is the creation of HSpam14 dataset, which can be used for hashtag-oriented spam research in tweets. Another contribution is the observations made from the preliminary analysis of the HSpam14 dataset.

References

[1]
F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers on twitter. In CEAS, 2010.
[2]
F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, and M. Gonçalves. Detecting spammers and content promoters in online video social networks. In SIGIR, pages 620--627, 2009.
[3]
A. Broder. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences, pages 21--29, 1997.
[4]
C. Castillo, M. Mendoza, and B. Poblete. Information credibility on twitter. In WWW, pages 675--684, 2011.
[5]
D. Chinavle, P. Kolari, T. Oates, and T. Finin. Ensembles in adversarial classification for spam. In CIKM, pages 2015--2018, 2009.
[6]
A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. on Info. Sys., pages 171--191, 2002.
[7]
Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia. Who is tweeting on twitter: Human, bot, or cyborg? In Proc. Annual Computer Security Appln Conf., pages 21--30, 2010.
[8]
T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. Nus-wide: A real-world web image database from national university of singapore. In CIVR, pages 48:1--48:9, 2009.
[9]
G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Inf. Ret., pages 335--455, 2008.
[10]
E. Ferrara, O. Varol, C. Davis, F. Menczer, and A. Flammini. The rise of social bots, 2014. arXiv:1407.5225.
[11]
N. Gao, W. Webber, and D. W. Oard. Reducing reliance on relevance judgments for system comparison by using expectation-maximization. In ECIR, pages 1--12, 2014.
[12]
S. Ghosh, B. Viswanath, F. Kooti, N. K. Sharma, G. Korlam, F. Benevenuto, N. Ganguly, and K. P. Gummadi. Understanding and combating link farming in the twitter social network. In WWW, pages 61--70, 2012.
[13]
C. Grier, K. Thomas, V. Paxson, and M. Zhang. @spam: The underground on 140 characters or less. In Proc. ACM Conf. on Computer and Communications Security, pages 27--37, 2010.
[14]
P. Heymann, G. Koutrika, and H. Garcia-Molina. Fighting spam on social web sites: A survey of approaches and future challenges. IEEE Internet Computing, 11:36--45, 2007.
[15]
X. Hu, J. Tang, and H. Liu. Leveraging knowledge across media for spammer detection in microblogging. In SIGIR, pages 547--556, 2014.
[16]
X. Hu, J. Tang, and H. Liu. Online social spammer detection. In AAAI, pages 59--65, 2014.
[17]
N. Jindal and B. Liu. Review spam detection. In WWW, pages 1189--1190, 2007.
[18]
N. Jindal and B. Liu. Opinion spam and analysis. In WSDM, pages 219--230, 2008.
[19]
K. Lee, J. Caverlee, and S. Webb. The social honeypot project: Protecting online communities from spammers. In WWW, pages 1139--1140, 2010.
[20]
K. Lee, B. D. Eoff, and J. Caverlee. Seven months with the devils: A long-term study of content polluters on twitter. In ICWSM, 2011.
[21]
K. Lee, S. Webb, and H. Ge. The dark side of micro-task marketplaces: Characterizing fiverr and automatically detecting crowdturfing. In ICWSM, 2014.
[22]
F. Li, M. Huang, Y. Yang, and X. Zhu. Learning to identify review spam. In IJCAI, pages 2488--2493, 2011.
[23]
J. Li, C. Cardie, and S. Li. Topicspam: a topic-model based approach for spam detection. In ACL, pages 217--221, 2013.
[24]
E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw. Detecting product review spammers using rating behaviors. In CIKM, pages 939--948, 2010.
[25]
B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In ICDM, pages 179--186, 2003.
[26]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval, volume 1. Cambridge University Press, 2008.
[27]
J. Messias, L. Schmidt, R. Oliveira, and F. Benevenuto. You followed my bot! transforming robots into influential users in twitter. First Monday, 18, 2013.
[28]
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, 2005.
[29]
A. Mukherjee, B. Liu, J. Wang, N. Glance, and N. Jindal. Detecting group review spam. In WWW, pages 93--94, 2011.
[30]
K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Mach. Learn., pages 103--134, 2000.
[31]
M. Ott, Y. Choi, C. Cardie, and J. T. Hancock. Finding deceptive opinion spam by any stretch of the imagination. In HLT, pages 309--319, 2011.
[32]
A. Shrivastava and P. Li. In defense of minhash over simhash. In AISTATS, pages 886--894, 2014.
[33]
N. Spirin and J. Han. Survey on web spam detection: Principles and algorithms. SIGKDD Explor. Newsl., 13(2):50--64, 2012.
[34]
E. Tan, L. Guo, S. Chen, X. Zhang, and Y. Zhao. Unik: Unsupervised social network spam detection. In CIKM, pages 479--488, 2013.
[35]
K. Thomas, C. Grier, D. Song, and V. Paxson. Suspended accounts in retrospect: An analysis of twitter spam. In IMC, pages 243--258, 2011.
[36]
C. Wagner, S. Mitter, C. Körner, and M. Strohmaier. When social bots attack: Modeling susceptibility of users in online social networks. In Workshop on Making Sense of Microposts, 2012.
[37]
G. Wang, C. Wilson, X. Zhao, Y. Zhu, M. Mohanlal, H. Zheng, and B. Y. Zhao. Serf and turf: Crowdturfing for fun and profit. In WWW, pages 679--688, 2012.
[38]
Z. Wang, W. Josephson, Q. Lv, M. Charikar, and K. Li. Filtering image spam with near-duplicate detection. In Proc. Conf. on Email and AntiSpam, pages 600--603, 2007.
[39]
J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: Finding topic-sensitive influential twitterers. In WSDM, pages 261--270, 2010.
[40]
L. Zhu, A. Sun, and B. Choi. Detecting spam blogs from blog search results. Inf. Process. Manage., 47(2):246--262, 2011.

Cited By

View all
  • (2024)Distributed Model Serving for Real-time Opinion Detection2024 IEEE International Conference on Service-Oriented System Engineering (SOSE)10.1109/SOSE62363.2024.00014(64-73)Online publication date: 15-Jul-2024
  • (2024)Diverse misinformation: impacts of human biases on detection of deepfakes on networksnpj Complexity10.1038/s44260-024-00006-y1:1Online publication date: 18-May-2024
  • (2024)CGANS: a code-based GAN for spam detection in social mediaSocial Network Analysis and Mining10.1007/s13278-024-01379-714:1Online publication date: 17-Nov-2024
  • Show More Cited By

Index Terms

  1. HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
    August 2015
    1198 pages
    ISBN:9781450336215
    DOI:10.1145/2766462
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 August 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. hashtag
    2. spam
    3. tweets
    4. twitter

    Qualifiers

    • Research-article

    Funding Sources

    • MOE Singapore

    Conference

    SIGIR '15
    Sponsor:

    Acceptance Rates

    SIGIR '15 Paper Acceptance Rate 70 of 351 submissions, 20%;
    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Distributed Model Serving for Real-time Opinion Detection2024 IEEE International Conference on Service-Oriented System Engineering (SOSE)10.1109/SOSE62363.2024.00014(64-73)Online publication date: 15-Jul-2024
    • (2024)Diverse misinformation: impacts of human biases on detection of deepfakes on networksnpj Complexity10.1038/s44260-024-00006-y1:1Online publication date: 18-May-2024
    • (2024)CGANS: a code-based GAN for spam detection in social mediaSocial Network Analysis and Mining10.1007/s13278-024-01379-714:1Online publication date: 17-Nov-2024
    • (2024)Supervised Machine Learning Based Anomaly Detection in Online Social NetworksInformation Systems and Technologies10.1007/978-3-031-45645-9_8(85-91)Online publication date: 14-Feb-2024
    • (2023)Markov-Driven Graph Convolutional Networks for Social Spammer DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.315066935:12(12310-12322)Online publication date: 1-Dec-2023
    • (2023)SpADe: Multi-Stage Spam Account Detection for Online Social NetworksIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.319883020:4(3128-3143)Online publication date: 1-Jul-2023
    • (2023)A pre-protective objective in mining females social contents for identification of early signs of depression using soft computing deep frameworkScientific Reports10.1038/s41598-023-40607-613:1Online publication date: 9-Sep-2023
    • (2023)A hybrid Data-Driven framework for Spam detection in Online Social NetworkProcedia Computer Science10.1016/j.procs.2022.12.408218:C(124-132)Online publication date: 1-Jan-2023
    • (2023)Learning textual features for Twitter spam detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120366228:COnline publication date: 15-Oct-2023
    • (2023)Hybrid ensemble framework with self-attention mechanism for social spam detection on imbalanced dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119594217:COnline publication date: 1-May-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media