[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2381896.2381907acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Robust detection of comment spam using entropy rate

Published: 19 October 2012 Publication History

Abstract

In this work, we design a method for blog comment spam detection using the assumption that spam is any kind of uninformative content. To measure the "informativeness" of a set of blog comments, we construct a language and tokenization independent metric which we call content complexity, providing a normalized answer to the informal question "how much information does this text contain?" We leverage this metric to create a small set of features well-adjusted to comment spam detection by computing the content complexity over groupings of messages sharing the same author, the same sender IP, the same included links, etc.
We evaluate our method against an exact set of tens of millions of comments collected over a four months period and containing a variety of websites, including blogs and news sites. The data was provided to us with an initial spam labeling from an industry competitive source. Nevertheless the initial spam labeling had unknown performance characteristics. To train a logistic regression on this dataset using our features, we derive a simple mislabeling tolerant logistic regression algorithm based on expectation-maximization, which we show generally outperforms the plain version in precision-recall space.
By using a parsimonious hand-labeling strategy, we show that our method can operate at an arbitrary high precision level, and that it significantly dominates, both in terms of precision and recall, the original labeling, despite being trained on it alone.
The content complexity metric, the use of a noise-tolerant logistic regression and the evaluation methodology are thus the three central contributions with this work.

References

[1]
A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743--2760, 1998.
[2]
A. Bratko, G. V. Cormack, B. Filipic, T. R. Lynam, and B. Zupan. Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7:2673--2698, Dec 2006.
[3]
M. Brennan, S. Wrazien, and R. Greenstadt. Learning to extract quality discourse in online communities. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
[4]
B.-C. Chen, J. Guo, B. Tseng, and J. Yang. User reputation in a comment rating environment. In Proceedings of KDD'11, 2011.
[5]
H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Zhao. Detecting and characterizing social spam campaigns. In Proceedings of IMC, 2010.
[6]
C.-F. Hsu, E. Khabiri, and J. Caverlee. Ranking Comments on the Social Web. In Proceedings of the IEEE International Conference on Computational Science and Engineering (CSE), pages 90--97, 2009.
[7]
K. Lee, J. Caverlee, and S. Webb. Uncovering social spammers: social honeypots+machine learning. In Proceedings of SIGIR, 2010.
[8]
K. Levenberg. A method for the solution of certain non-linear problems in least squares. In Quarterly of Applied Mathematics, volume 2, pages 164--168, 1944.
[9]
D. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. In SIAM Journal on Applied Mathematics, volume 11, pages 430--441, 1963.
[10]
G. Mishne, D. Carmel, and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In Proceedings of the International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[11]
G. Mishne and N. Glance. Leave a Reply: An Analysis of Weblog Comments. In Proceedings of the International World Wide Web Conference (WWW), 2006.
[12]
A. Mishra and R. Rastogi. Semi-supervised correction of biased comment ratings. In Proceedings of WWW'12, 2012.
[13]
I. Pavlov. LZMA SDK (software development kit), 2007.
[14]
P. L. R. H. Byrd and J. Nocedal. A limited memory algorithm for bound constrained optimization. In SIAM Journal on Scientific and Statistical Computing, volume 16, pages 1190--1208, 1995.
[15]
A. Ramachandran and N. Feamster. Understanding the network-level behavior of spammers. In Proceedings of Sigcomm, 2006.
[16]
A. Ramachandran, N. Feamster, and S. Vempala. Filtering spam with behavioral blacklisting. In Proceedings of CCS'07, 2007.
[17]
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297--1322, April 2010.
[18]
T. Schurmann and P. Grassberger. Entropy estimation of symbol sequence. Chaos (Woodbury, N.Y.), 6(3):414--427, Sept. 1996.
[19]
D. Sculley and G. M. Wachman. Relaxed online svms for spam filtering. In Proceedings of SIGIR'07, 2007.
[20]
C. Shannon. Prediction and Entropy of Printed English. Bell System Technical Journal, 30(1):50--64, 1951.
[21]
Y. Shin, M. Gupta, and S. Myers. Prevalence and mitigation of forum spamming. In INFOCOM, 2011 Proceedings IEEE, pages 2309--2317. IEEE, 2011.
[22]
Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I. Osipkov. Spamming botnets: Signatures and characteristics. In Proceedings of SIGCOMM 08, 2008.
[23]
J.-M. Xu, G. Fumera, F. Roli, and Z.-H. Zhou. Training spamassassin with active semi-supervised learning. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS'09), 2009.
[24]
L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, I. Osipkov, G. Hulten, and J. Tygar. Characterizing botnets from email spam records. In Proceedings of LEET, 2008.

Cited By

View all
  • (2023)Youtube Spam Detection Scheme Using Stacked Ensemble Machine Learning Model2023 International Conference on Network, Multimedia and Information Technology (NMITCON)10.1109/NMITCON58196.2023.10276002(1-7)Online publication date: 1-Sep-2023
  • (2023)Machine Learning based Spam Comments Detection on YouTube2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS)10.1109/ICICCS56967.2023.10142608(1234-1239)Online publication date: 17-May-2023
  • (2023)Evaluation of AI Techniques for Detecting Deceptive Reviews in Cyberspace: A Study of Pre- and Post-COVID-19 Trends2023 Second International Conference on Electronics and Renewable Systems (ICEARS)10.1109/ICEARS56392.2023.10085689(961-967)Online publication date: 2-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
AISec '12: Proceedings of the 5th ACM workshop on Security and artificial intelligence
October 2012
116 pages
ISBN:9781450316644
DOI:10.1145/2381896
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. comment spam
  2. content complexity
  3. logistic regression
  4. noisy label
  5. spam filtering

Qualifiers

  • Research-article

Conference

CCS'12
Sponsor:
CCS'12: the ACM Conference on Computer and Communications Security
October 19, 2012
North Carolina, Raleigh, USA

Acceptance Rates

AISec '12 Paper Acceptance Rate 10 of 24 submissions, 42%;
Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Youtube Spam Detection Scheme Using Stacked Ensemble Machine Learning Model2023 International Conference on Network, Multimedia and Information Technology (NMITCON)10.1109/NMITCON58196.2023.10276002(1-7)Online publication date: 1-Sep-2023
  • (2023)Machine Learning based Spam Comments Detection on YouTube2023 7th International Conference on Intelligent Computing and Control Systems (ICICCS)10.1109/ICICCS56967.2023.10142608(1234-1239)Online publication date: 17-May-2023
  • (2023)Evaluation of AI Techniques for Detecting Deceptive Reviews in Cyberspace: A Study of Pre- and Post-COVID-19 Trends2023 Second International Conference on Electronics and Renewable Systems (ICEARS)10.1109/ICEARS56392.2023.10085689(961-967)Online publication date: 2-Mar-2023
  • (2021)A YouTube Spam Comments Detection Scheme Using Cascaded Ensemble Machine Learning ModelIEEE Access10.1109/ACCESS.2021.31215089(144121-144128)Online publication date: 2021
  • (2021)Video Description Based YouTube Comment ClassificationApplications of Artificial Intelligence in Engineering10.1007/978-981-33-4604-8_51(667-678)Online publication date: 11-May-2021
  • (2020)On the Complexity of Traffic Traces and ImplicationsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/33794864:1(1-29)Online publication date: 5-Jun-2020
  • (2020)A Survey on Machine Learning Techniques for Cyber Security in the Last DecadeIEEE Access10.1109/ACCESS.2020.30419518(222310-222354)Online publication date: 2020
  • (2020)Analysis and Classification of User Comments on YouTube VideosProcedia Computer Science10.1016/j.procs.2020.10.084177(593-598)Online publication date: 2020
  • (2019)Adversarial Machine Learning10.1017/9781107338548Online publication date: 14-Mar-2019
  • (2018)N-Gram Assisted Youtube Spam Comment DetectionProcedia Computer Science10.1016/j.procs.2018.05.181132(174-182)Online publication date: 2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media